删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

How machine learning conquers the unitary limit

本站小编 Free考研考试/2022-01-02

Bastian Kaspschak,1,, Ulf-G Meißner,1,2,31Helmholtz-Institut für Strahlen- und Kernphysik and Bethe Center for Theoretical Physics, Universität Bonn, D-53115 Bonn, Germany
2Institute for Advanced Simulation, Institut für Kernphysik, and Jülich Center for Hadron Physics, Forschungszentrum Jülich, D-52425 Jülich, Germany
3Tbilisi State University, 0186 Tbilisi, Georgia

First author contact: Author to whom any correspondence should be addressed.
Received:2020-10-19Revised:2020-12-7Accepted:2020-12-10Online:2021-02-04


Abstract
Machine learning has become a premier tool in physics and other fields of science. It has been shown that the quantum mechanical scattering problem cannot only be solved with such techniques, but it was argued that the underlying neural network develops the Born series for shallow potentials. However, classical machine learning algorithms fail in the unitary limit of an infinite scattering length. The unitary limit plays an important role in our understanding of bound strongly interacting fermionic systems and can be realized in cold atom experiments. Here, we develop a formalism that explains the unitary limit in terms of what we define as unitary limit surfaces. This not only allows to investigate the unitary limit geometrically in potential space, but also provides a numerically simple approach towards unnaturally large scattering lengths with standard multilayer perceptrons. Its scope is therefore not limited to applications in nuclear and atomic physics, but includes all systems that exhibit an unnaturally large scale.
Keywords: unitary limit;machine learning;quantum physics


PDF (872KB)MetadataMetricsRelated articlesExportEndNote|Ris|BibtexFavorite
Cite this article
Bastian Kaspschak, Ulf-G Meißner. How machine learning conquers the unitary limit. Communications in Theoretical Physics, 2021, 73(3): 035101- doi:10.1088/1572-9494/abd84d

1. Introduction

After neural networks have already been successfully used in experimental applications, such as particle identification, see e.g. [1], much progress has been made in recent years by applying them to various fields of theoretical physics, such as [2, 4, 3, 514]. An interesting property of neural networks is that their prediction is achieved in terms of simple mathematical operations. A notable example is given by multilayer perceptrons (MLPs), see equation (4), that approximate continuously differentiable functions exclusively by matrix multiplications, additions and componentwise applied nonlinear functions. Therefore, a neural network approach bypasses the underlying mathematical framework of the respective theory and still provides satisfactory results.

Despite their excellent performance, a major drawback of many neural networks is their lack of interpretability, which is expressed by the term ‘black boxes’. However, there are methods to restore interpretability. A premier example is given in [10]: by investigating patterns in the networks’ weights, it is demonstrated that MLPs develop perturbation theory in terms of Born approximations to predict natural S-wave scattering lengths a0 for shallow potentials. Nevertheless, this approach fails for deeper potentials, especially if they give rise to zero-energy bound states and thereby to the unitary limit ${a}_{0}\to \infty $. The physical reason for this is that the unitary limit is a highly non-perturbative scenario. In addition, the technical difficulty of reproducing singularities with neural networks emerges, which requires unconventional architectures. Note that in its initial formulation, the Bertsch problem for the unitary Fermi gas includes a vanishing effective range ${r}_{0}\to 0$ as an additional requirement for defining the unitary limit, see e.g. [15]. However, r0 is non-zero and finite, that is it violates scale invariance, for ${a}_{0}\to \infty $ in real physical systems, on which we want to focus in this work. Therefore, the case ${a}_{0}\to \infty $ we consider as the unitary limit, is independent of the effective range and, thus, less restrictive than the definition in the Bertsch problem. The unitary limit plays an important role in our understanding of bound strongly interacting fermionic systems [1621] and can be realized in cold atom experiments, see, e.g. [22]. Therefore, the question arises how to deal with such a scenario in terms of machine learning? Our idea is explain the unitary limit as a movable singularity in potential space. This formalism introduces two geometric quantities f and b0 that are regular for ${a}_{0}\to \infty $ and, therefore, can be easily approached by standard MLPs. Finally, natural and unnatural scattering lengths are predicted with sufficient accuracy by composing the respective networks.

The manuscript is organized as follows: in section 2, we introduce the concept of unitary limit surfaces and define a scale factor f, that allows to describe the potential around the first unitary limit surface. In section 3 we determine this factor with an ensemble of MLPs which is followed in section 4 with the determination of the scattering length a0 in the vicinity of the first unitary limit surface. We pick up the issue of interpretability in section 5, using the Taylor expansion around a suitably chosen point on the first unitary surface. We end with further discussions and an outlook in section 6. Various technicalities are relegated to the appendices.

2. Discretized potentials and unitary limit surfaces

As we investigate the unitary limit, we only consider attractive potentials. For simplicity, the following analysis is restricted to non-positive, sphericallysymmetricpotentials V(r)≤0 with finite range ρ. Together with thereducedmass μ, the latter parameterizes all dimensionless quantities. The most relevant for describing low-energy scattering processes turn out to be the dimensionless potential U=−2μρ2V≥0 and the S-wave scattering lengtha0. An important first step is to discretize potentials, since these can then be treated as vectors ${\boldsymbol{U}}$ ∈Ω⊂${{\mathbb{R}}}^{d}$ with non-negative components Un=U(/d)≥0 and become processable by common neural networks. We associate the piecewise constant step potential$\begin{eqnarray}U(r)=\left\{\begin{array}{lll}{U}_{1} & \mathrm{if}\ & {r}_{0}\leqslant r\lt {r}_{1},\\ & \vdots & \\ {U}_{n} & \mathrm{if} & {r}_{n-1}\leqslant r\lt {r}_{n},\\ & \vdots & \\ 0 & \mathrm{if} & {r}_{d}\leqslant r\lt {r}_{d+1},\end{array}\right.\end{eqnarray}$where the rn=n/d for n=0, …, d and ${r}_{d+1}=\infty $ are the transition points of the given potential, with a real vector ${\boldsymbol{U}}\in {\rm{\Omega }}$. The components Un of this vector correspond to the individual values, the step potential in equation (1) takes between the (n − 1)th and the nth transition point. In the following, we, therefore, refer to vectors ${\boldsymbol{U}}\in {\rm{\Omega }}$ as discretized potentials. The degree d of discretization thereby controls the granularity and corresponds to the inverse step size of the emerging discretized potentials. Choosing it sufficiently large ensures that the entailed discretization error, e.g. on the scattering length, becomes insignificantly small. For instance, taking d=64 as in the following analysis reproduces the basic characteristics of all considered potentials and satisfies that requirement.

As a further result of discretization, the potential space is reduced to the first hyperoctant Ω of ${{\mathbb{R}}}^{d}$. Counting bound states naturally splits Ω=${\bigcup }_{i\in {{\mathbb{N}}}_{0}}$ Ωi into pairwise disjunct, half-open regionsΩi, with Ωi containing all potentials with exactly i bound states. All potentials on the d−1 dimensional hypersurface ${{\rm{\Sigma }}}_{i}\ \equiv \partial {{\rm{\Omega }}}_{i-1}\cap {{\rm{\Omega }}}_{i}$ between two neighboring regions with ${{\rm{\Sigma }}}_{i}\subset {{\rm{\Omega }}}_{i}$ give rise to a zero-energy bound state, see figure 1. Since we observe the unitarylimit${a}_{0}\,\to \infty $ in this scenario, we refer to Σi as the ith unitary limit surface. Considering the scattering length as a function ${a}_{0}:{\rm{\Omega }}\to {\mathbb{R}}$, this suggests a movable singularity on each unitary limit surface. For simplicity, we decide to focus on the first unitary limit surfaceΣ1, as this approach easily generalizes to higher order surfaces. Let ${\boldsymbol{U}}\in {\rm{\Omega }}$ and $f\in {{\mathbb{R}}}^{+}$ be a factor satisfying $f{\boldsymbol{U}}\in {{\rm{\Sigma }}}_{1}$. This means scaling ${\boldsymbol{U}}$ by the unique factor f yields a potential on the first unitary limit surface. While potentials with an empty spectrum must be deepened to obtain a zero-energy bound state, potentials whose spectrum already contains a dimer with finite binding energy E<0 need to be flattened instead. Accordingly, this behavior is reflected in the following inequalities:$\begin{eqnarray}f\,\,\left\{\begin{array}{ll}\gt 1 & \mathrm{if}\ {\boldsymbol{U}}\in {{\rm{\Omega }}}_{0},\\ =1 & \mathrm{if}\ {\boldsymbol{U}}\in {{\rm{\Sigma }}}_{1}\\ \lt 1, & \mathrm{else}.\end{array}\right.\end{eqnarray}$

Figure 1.

New window|Download| PPT slide
Figure 1.Sketch of the regions Ω0 and Ω1 and the first unitary limit surface ${{\rm{\Sigma }}}_{1}\subset {{\rm{\Omega }}}_{1}$ for the degree d=2 of discretization. In this specific case, the potential space Ω is the first quadrant of ${{\mathbb{R}}}^{2}$ and unitary limit surfaces are one-dimensional manifolds.


3. Predicting f with an ensemble of MLPs

The factor f seems to be a powerful quantity for describing the geometry of the unitary limit surfaceΣ1. The latter is merely the contour for f=1. It is a simple task to derive f iteratively by scaling a given potential ${\boldsymbol{U}}$ until the scattering length flips sign, see appendix A. However, an analytic relation between ${\boldsymbol{U}}$ and f remains unknown to us. The remedy for this are neural networks that are trained supervisedly on pairs $({\boldsymbol{U}},f)\in {T}_{1}$ of potentials (inputs) and corresponding factors (targets) in some training setT1. In this case, neural networks can be understood as maps ${ \mathcal F }:{\rm{\Omega }}\to {\mathbb{R}}$ that additionally depend on numerous internal parameters. The key idea of supervised training is to adapt the internal parameters iteratively such that the outputs ${ \mathcal F }({\boldsymbol{U}})$ approach the targets f ever closer. As a result of training, ${ \mathcal F }$ approximates the underlying function ${\boldsymbol{U}}\ \mapsto f$, such that the factor ${f}^{* }\approx { \mathcal F }({{\boldsymbol{U}}}^{* })$ is predicted with sufficient accuracy even if the potential ${{\boldsymbol{U}}}^{* }\in {\rm{\Omega }}$ does not appear in T1, as long as it resembles the potentials encountered during training. This is also referred to as generalization. In order to measure the performance of ${ \mathcal F }$ on unknown data, one considers a test set T2 containing new pairs $({{\boldsymbol{U}}}^{* },{f}^{* })$ and the mean average percentage error (MAPE) on that set,$\begin{eqnarray}\mathrm{MAPE}=\displaystyle \frac{1}{| {T}_{2}| }\displaystyle \sum _{({{\boldsymbol{U}}}^{* },{f}^{* })\in {T}_{2}}\left|\displaystyle \frac{{ \mathcal F }({{\boldsymbol{U}}}^{* })-{f}^{* }}{{f}^{* }}\right|.\end{eqnarray}$By generating randomized inputs ${\boldsymbol{U}}$ via Gaussian random walks, we ensure that the training set covers a wide range of different potential shapes, seeappendix A. This is extremely important, since we want to avoid the neural network to overfit to specific potential shapes and instead to optimally generalize over the given region of interest around Σ1 in potential space.

We decide to work with MLPs. These are a widely distributed and very common class of neural networks and provide an excellent performance for simpler problems. Here, an MLP ${{ \mathcal F }}_{i}$ with L layers is a composition$\begin{eqnarray}{{ \mathcal F }}_{i}={Y}_{L}\circ \ldots \circ {Y}_{1}\end{eqnarray}$of functions ${Y}_{j}:\,{V}_{j-1}\to {V}_{j}$. Usually we have ${V}_{j}={{\mathbb{R}}}^{{h}_{j}}$. While ${Y}_{1}:{\rm{\Omega }}\to {V}_{1}$ and ${Y}_{L}:{V}_{L-1}\to {\mathbb{R}}$ are called the input and output layers, respectively, each intermediate layer is referred to as a hidden layer. The layer Yj depends on a weight matrix ${W}_{j}\,\in \,{{\mathbb{R}}}^{{h}_{j}\times {h}_{j-1}}$ and a bias ${{\boldsymbol{b}}}_{j}\in {{\mathbb{R}}}^{{h}_{j}}$, both serving as internal parameters, and performs the operation$\begin{eqnarray}{Y}_{j}({\boldsymbol{v}})={a}_{j}({W}_{j}{\boldsymbol{v}}+{{\boldsymbol{b}}}_{j})\end{eqnarray}$on the vector ${\boldsymbol{v}}\in {V}_{j-1}$. The function ${a}_{j}:{\mathbb{R}}\to {\mathbb{R}}$ is called the activation function of the jth layer and is applied component-wise to vectors. Using nonlinear activation functions is crucial in order to make MLPs universal approximators. While output layers are classically activated via the identity, we activate all other layers via the continuously differentiable exponential linear unit (CELU) [23],$\begin{eqnarray}\mathrm{CELU}(v)=\max (0,v)+\min (0,\exp (v)-1).\end{eqnarray}$We use CELU because it is continuously differentiable, has bounded derivatives, allows positive and negative activations and finally bypasses the vanishing-gradient-problem, which renders it very useful for deeper architectures. In order to achieve precise predictions of the factors f, we decide to train an ensemble of ${N}_{{ \mathcal F }}$=100 MLPs ${{ \mathcal F }}_{i}$, with each MLP consisting of nine CELU-activated 64×64 linear layers and one output layer. Thereby, an ensemble can be understood as a point cloud in weight space with the ith point representing ${{ \mathcal F }}_{i}$. We choose the output of the ensemble to be simply the mean of all individual outputs,$\begin{eqnarray}{ \mathcal F }({\boldsymbol{U}})=\displaystyle \frac{1}{{N}_{{ \mathcal F }}}\displaystyle \sum _{i=1}^{{N}_{{ \mathcal F }}}{{ \mathcal F }}_{i}({\boldsymbol{U}}).\end{eqnarray}$The training and test data sets contain $| {T}_{1}| $=3×104 and $| {T}_{2}| $=2.9×103 samples, respectively, with the degree d=64 of discretization, as described inappendix A. Positive and negative scattering lengths are nearly equally represented in each data set. After 20 epochs, that is after having scanned through the training set for the 20th time, the training procedure is terminated and the resulting MAPE of the ensemble ${ \mathcal F }$ turns out as 0.028%. When plotting predictions versus targets, this implies a thin point cloud that is closely distributed around the bisector as can be seen in figure 2. We therefore conclude that ${ \mathcal F }$ returns very precise predictionsonf.

Figure 2.

New window|Download| PPT slide
Figure 2.Predictions ${ \mathcal F }({\boldsymbol{U}})$ of the scaling factor by the ensemble${ \mathcal F }$ versus the targets f for all $({\boldsymbol{U}},f)\in {T}_{2}$. The resulting point cloud is very closely distributed around the bisector, which indicates an excellent performance of ${ \mathcal F }$ on the test set T2.


4. Predicting scattering lengths in vicinity of Σ1

Our key motivation is to predict scattering lengths in vicinity of Σ1. Being a movable singularity in potential space, the unitary limit itself imposes severe restrictions on MLP architectures and renders training steps unstable. Therefore, we opt for the alternative approach of expressing scattering lengths in terms of regular quantities, that each can be easily predicted by MLPs. Given the factors f we first consider$\begin{eqnarray}{b}_{0}={a}_{0}(1-f)=\displaystyle \frac{{a}_{0}x}{\parallel {\boldsymbol{U}}\parallel }.\,\end{eqnarray}$Note that $x=(1-f)\parallel {\boldsymbol{U}}\parallel $ is the distance between the given potential ${\boldsymbol{U}}\in {\rm{\Omega }}$ and $f\ {\boldsymbol{U}}\in {{\rm{\Sigma }}}_{1}$. For ${\boldsymbol{U}}\in {{\rm{\Omega }}}_{0}$, we observe x<0, whereas x=0 for ${\boldsymbol{U}}\in {{\rm{\Sigma }}}_{1}$ and x>0 else. The quantity b0 provides an equivalent understanding of this distance in terms of $x={b}_{0}/{a}_{0}\parallel {\boldsymbol{U}}\parallel $, which does not explicitly depend on the factor f.

As shown in figure 3, b0 is finite and restricted to a small interval for all considered potentials. This does not imply that f and b0 are globally regular. Indeed, f diverges for ${\boldsymbol{U}}\to 0$ and b0 diverges on each higher order unitary limit surface. However, these two scenarios have no impact on our actual analysis.

Figure 3.

New window|Download| PPT slide
Figure 3.b0 versus the corresponding factors f for all potentials ${\boldsymbol{U}}$ in the training set T1. Note that b0 is restricted to a small interval. The width of the point cloud suggests that there is no one-to-one relation between b0 and f.


Similar to the ensemble ${ \mathcal F }$ in equation (7), we train an ensemble ${ \mathcal B }$ of ${N}_{{ \mathcal B }}=100$ MLPs${{ \mathcal B }}_{i}$. While the members${{ \mathcal B }}_{i}$ only consist of five CELU-activated 64×64 linear layers and one output layer, the rest coincides with the training procedure as presented in the previous section. The resulting MAPE of ${ \mathcal B }$ turns out as 0.017%. Since all potentials in T1 are distributed around Σ1, this correspondingly indicates that ${ \mathcal B }$ approximates the relation ${\boldsymbol{U}}\mapsto {b}_{0}$ very well in vicinity of the first unitary limit surface.

Having trained the ensembles ${ \mathcal B }$ and ${ \mathcal F }$ to predict b0 and f precisely (see appendix B for details), we are able to construct scattering lengths according to equation (8): we expect the quotient$\begin{eqnarray}{ \mathcal A }({\boldsymbol{U}})=\displaystyle \frac{{ \mathcal B }({\boldsymbol{U}})}{1-{ \mathcal F }({\boldsymbol{U}})}\end{eqnarray}$to provide a good approximation of a0 for potentials in vicinity of Σ1. This may appear counterintuitive at first, since the scattering length has a clear physics interpretation, whereas the quantities b0 and f refer to points and distances in potential space, that are not observable themselves. However, requiring all potentials to be of finite range suffices for a unique effective range expansion and thus renders this approach model-independent. Figure 4(a) shows scattering lengths for potential wells that are predicted by ${ \mathcal A }$. As expected, we observe a singularity for the depth u=π2/4. In the same figure, this is compared to the case, in which another ensemble ${ \mathcal A }$ of ten MLPs ${{ \mathcal A }}_{i}$ is given. These do not compute f and b0 as intermediate steps, but have been trained to predict a0 directly. However, since the ${{ \mathcal A }}_{i}$ are continuously differentiable, a singularity cannot be reproduced. This retrospectively justifies the proposed approach in equation (9).

Figure 4.

New window|Download| PPT slide
Figure 4.(a): Predicted scattering lengths ${ \mathcal A }({\boldsymbol{U}})$ for potential wells with depths u with ${ \mathcal A }$ as given in equation (9) (solid red line) and, respectively, with an ensemble ${ \mathcal A }$ of ten MLPs ${{ \mathcal A }}_{i}$ that are trained to directly predict a0 (dashed black line). (b): Relative errors ${\varepsilon }_{{}_{{ \mathcal A }({\boldsymbol{U}})}}$ of predicted scattering lengths ${ \mathcal A }({\boldsymbol{U}})$ for potential wells with depths u. The $| {\varepsilon }_{{}_{{ \mathcal A }({\boldsymbol{U}})}}| $ only take values between 1% and 15% in close vicinity of the unitary limit and become negligibly small elsewhere.


Note that outputs ${ \mathcal A }({\boldsymbol{U}})$ for potentials in the unitary limit f→1 are very sensitive to ${ \mathcal F }({\boldsymbol{U}})$. In this regime, even the smallest errors may cause a large deviation from the target values and thereby corrupt the accuracy of ${ \mathcal A }$. In figure 4(b) we observe significantly larger relative errors ${\varepsilon }_{{ \mathcal A }({\boldsymbol{U}})}$ = $({ \mathcal A }({\boldsymbol{U}})$ −a0)/a0 in a small interval around the unitary limit at u=π2/4. Of course, this problem could be mitigated by a more accurate prediction of f, but the underlying difficulty is probably less machine-learning-related and is rather caused by working with large numbers. Nonetheless, the quotient ${ \mathcal A }$ reproduces the basic behavior of a0 sufficiently well for our purposes. Inspecting the prediction-versus-target plot for the test set T2 is another and more general, shape-independent way to convince ourselves of this, see figure 5. Although we notice a broadening of the point cloud for unnaturally large scattering lengths, the point cloud itself remains clearly distributed around the bisector. This implies that ${ \mathcal A }$ predicts natural and unnatural scattering lengths precisely enough and agrees with its low relatively low MAPE. Finally, the resulting MAPE of 0.41% indicates an overall good performance of ${ \mathcal A }$ on the test set T2.

Figure 5.

New window|Download| PPT slide
Figure 5.Predictions ${ \mathcal A }({\boldsymbol{U}})$ of scattering lengths by the quotient ${ \mathcal A }$ versus the targets a0. The point cloud becomes broader for unnaturally large scattering lengths. Nonetheless it is still distributed sufficiently close around the bisector, which indicates that ${ \mathcal A }$ generalizes well and reproduces the correct behavior of a0 around Σ1.


5. Taylor expansion for interpretability

Considering the quotient ${ \mathcal A }({\boldsymbol{U}})={ \mathcal B }({\boldsymbol{U}})/(1-{ \mathcal F }({\boldsymbol{U}}))$, we can make reliable predictions on natural and unnatural scattering lengths. We have established a geometrical understanding of the quantities f and b0, predicted by ${ \mathcal F }$ and ${ \mathcal B }$, respectively. However, since both ensembles are ‘black boxes’, their outputs and the outputs of ${ \mathcal A }$ are no longer interpretable beyond that level. One way to establish interpretability is to consider their Taylor expansions with respect to an appropriate expansion point ${{\boldsymbol{U}}}^{* }$. In the following, we demonstrate this for the ensemble ${ \mathcal F }$ with ${{\boldsymbol{U}}}^{* }$ on the first unitary limit surface: since ${ \mathcal F }$ is regular in ${{\boldsymbol{U}}}^{* }\in {{\rm{\Sigma }}}_{1}$, its Taylor series can be written as$\begin{eqnarray}{ \mathcal F }({\boldsymbol{U}})={ \mathcal F }({{\boldsymbol{U}}}^{* })+{\boldsymbol{n}}\cdot \delta {\boldsymbol{U}}+{ \mathcal O }({\boldsymbol{\delta }}{{\boldsymbol{U}}}^{2})\end{eqnarray}$with the displacement ${\boldsymbol{\delta }}{\boldsymbol{U}}={\boldsymbol{U}}-{{\boldsymbol{U}}}^{* }$ and the vector ${\boldsymbol{n}}$ with the components ${n}_{i}={\left.\partial { \mathcal F }/\partial {Y}_{a}\right|}_{{\boldsymbol{Y}}={{\boldsymbol{U}}}^{* }}$. For small displacements $\parallel {\boldsymbol{\delta }}{\boldsymbol{U}}\parallel \ll 1$ higher order terms in equation (10) become negligible. From construction we know that ${ \mathcal F }({{\boldsymbol{U}}}^{* })\approx 1$ for any ${{\boldsymbol{U}}}^{* }\in {{\rm{\Sigma }}}_{1}$. Note that ${\boldsymbol{n}}$ is the normal vector of the first unitary limit surface at the point ${{\boldsymbol{U}}}^{* }$. This is because ${ \mathcal F }({\boldsymbol{U}})$ is invariant under infinitesimal, orthogonal displacements ${\boldsymbol{\delta }}{\boldsymbol{U}}\perp {\boldsymbol{n}}$.

To give an example, we consider the first order Taylor approximation with respect to the potential well ${{\boldsymbol{U}}}^{* }$=${\pi }^{2}/4\left(\begin{array}{c}1,\ldots ,1\end{array}\right)$ in Σ1. At first we derive ${ \mathcal F }({{\boldsymbol{U}}}^{* })$= (1–3.19)×10−5. Due to the rather involved architecture of ${ \mathcal F }$, we decide to calculate derivatives numerically, that is ${n}_{i}\approx [{ \mathcal F }({{\boldsymbol{U}}}^{* }+{\rm{\Delta }}{{\boldsymbol{e}}}_{i})-{ \mathcal F }({{\boldsymbol{U}}}^{* })]/{\rm{\Delta }}$, with the ith basis vector ${{\boldsymbol{e}}}_{i}$ and the step size Δ=0.01. In figure 6 we can see that ${\boldsymbol{n}}$ is far from collinear to ${{\boldsymbol{U}}}^{* }$, which implies a complicated topology for Σ1. Using the expansion in equation (10) and the components ni shown in figure 6, we arrive at an interesting and interpretable approximation of ${ \mathcal A }({\boldsymbol{U}})$ around ${{\boldsymbol{U}}}^{* }$,$\begin{eqnarray}{ \mathcal A }({\boldsymbol{U}})\approx \displaystyle \frac{{ \mathcal B }({{\boldsymbol{U}}}^{* })}{1-{ \mathcal F }({{\boldsymbol{U}}}^{* })-{\boldsymbol{n}}\cdot \delta {\boldsymbol{U}}}.\end{eqnarray}$Equation (11) allows us to interpret the unitary limit locally in terms of a scalar product, which reproduces the tangent plane of Σ1 at the point ${{\boldsymbol{U}}}^{* }\in {{\rm{\Sigma }}}_{1}$. Successively inserting higher order terms of the Taylor series in equation (10) would correspondingly introduce curvature terms. Let us consider displacements ${\boldsymbol{\delta }}{\boldsymbol{U}}=(u-{\pi }^{2}/4)\left(1,\ldots ,1\right)$ that are parallel to ${{\boldsymbol{U}}}^{* }$, such that ${{\boldsymbol{U}}}^{* }+\delta {\boldsymbol{U}}$ is a potential well with depth u. By inserting ${ \mathcal B }({{\boldsymbol{U}}}^{* })=0.81$ and ${\sum }_{i}{n}_{i}=-0.40$, equation (11) becomes$\begin{eqnarray}{ \mathcal A }({\boldsymbol{U}})\approx \displaystyle \frac{2.01}{u-({\pi }^{2}/4-7.92\times {10}^{-5})}.\end{eqnarray}$We can compare this to the expected behavior ${a}_{0}=1-\tan \sqrt{u}/\sqrt{u}$ of the S-wave scattering length for potential wells. The Padé-approximant of order [0/1] of this function at u=π2/4 is given by a0≈$2/(u-{\pi }^{2}/4)$, which agrees with the approximation equation (12) of scattering length predictions for inputs ${\boldsymbol{U}}=\,u\left(1,\ldots ,1\right)$ in the vicinity of ${{\boldsymbol{U}}}^{* }$ by the quotient ${ \mathcal A }$. Using both approximations for ${ \mathcal A }({\boldsymbol{U}})$ and a0 we obtain$\begin{eqnarray}{\varepsilon }_{{ \mathcal A }({\boldsymbol{U}})}=\displaystyle \frac{5\left[u-({\pi }^{2}/4+1.58\times {10}^{-2})\right]}{u-({\pi }^{2}/4-7.92\times {10}^{-5})}\times {10}^{-3}\end{eqnarray}$as an estimate on the relative error. We convince ourselves that, up to a steep divergence for the depth u=π2/(4–7.92)×10−5 due to a minor deviation in the root of the denominator, the relative error ${\varepsilon }_{{ \mathcal A }({\boldsymbol{U}})}$ enters the per mille range, as we have already seen in figure 4(b).

Figure 6.

New window|Download| PPT slide
Figure 6.Components of the normal vector ${\boldsymbol{n}}$ of the unitary limit surface at ${{\boldsymbol{U}}}^{* }={\pi }^{2}/4\left(1,\ldots ,1\right)$. As a gradient of ${ \mathcal F }$, this vector points towards the strongest ascent of f, which explains why its components are negative.


6. Discussion and outlook

The unitary limit ${a}_{0}\to \infty $ is realized by movable singularities in potential space Ω, each corresponding to a hypersurface ${{\rm{\Sigma }}}_{i}\subset {\rm{\Omega }}$ that we refer to as the ith unitary limit surface. This formalism not only lets one understand the unitary limit in a geometric manner, but also introduces new quantities f and b0, that are related to the radial distance between the corresponding potential ${\boldsymbol{U}}$ and the first unitary limit surface Σ1. These are regular in the unitary limit and provide an alternative parameterization of low-energy scattering processes. As such, they suffice to derive the S-wave scattering length a0. By training ensembles of MLPs in order to predict f and b0, respectively, we therefore successfully establish a machine learning based description for unnatural as well as natural scattering lengths.

There is one major problem that remains unresolved by the presented approach: predictions ${ \mathcal A }({\boldsymbol{U}})$ of unnaturally large scattering lengths sensitively depend on the precision of the ensemble ${ \mathcal F }$. Minor errors in f cause the predicted first unitary limit surface to slightly deviate from the actual surface Σ1. In very close neighborhood of the unitary limit, this generates diverging relative errors with respect to the true scattering lengths a0. As the predictions of ${ \mathcal F }$ will always be erroneous to a certain degree, this problem cannot be solved by optimizing the architecture or the training method. Instead, it is less machine-learning-related and rather originates in the handling of large numbers. However, the presented method involving f and b0 is still superior to more conventional and naive approaches like predicting the inverse scattering length 1/a0, which is obviously regular in the unitary limit, and considering the inverse prediction afterwards. Although the latter would provide a good estimate on unnatural scattering lengths, too, it would fail for a0≈0, whereas the divergence of f for extremely shallow potentials ${\boldsymbol{U}}\in {\rm{\Omega }}$ (that is $\parallel {\boldsymbol{U}}\parallel \ll 1$) can be easily factored out using ${f}_{{\boldsymbol{U}}}=\alpha {f}_{\alpha \cdot {\boldsymbol{U}}}$ for any $\alpha \in {{\mathbb{R}}}_{0}^{+}$, such that ${f}_{\alpha \cdot {\boldsymbol{U}}}$ is regular. For example, when choosing $\alpha =1/\parallel {\boldsymbol{U}}\parallel $, the factors ${f}_{{\boldsymbol{U}}/\parallel {\boldsymbol{U}}\parallel }$ correspond to the radial coordinate of Σ1 in the direction ${\boldsymbol{U}}/\parallel {\boldsymbol{U}}\parallel $ from the origin of Ω, which is clearly finite. Apart from the geometric interpretation of the unitary limit, which simply cannot be provided by classical approaches, we, therefore, conjecture the proposed method to offer the opportunity of simultaneously determining extremely large and small scattering lengths with sufficient precision. Nevertheless, it does not substitute a direct solution of the Schrödinger equation. The asymptotics of its solutions are required in order to compute the effective range function and, finally, the scattering lengths. As targets, the latter are an important part of supervised learning and have to be determined before initiating the training procedure.

We recall that both ensembles leave training as ‘black boxes’. By considering their Taylor approximations, we can obtain an interpretable expression of the predicted scattering lengths ${ \mathcal A }({\boldsymbol{U}})$ in terms of a scalar product. This also provides additional geometric insights like normal vectors on the unitary limit surface.

Note that the presented approach is far more general, than the above analysis of Σ1 suggests and, in fact, is a viable option whenever movable singularities come into play. First of all, we could have defined the quantities f and b0 with respect to any higher order unitary limit surface Σi with i>1, since we can apply the same geometric considerations as for Σ1. Adapting the training and test sets, such that all potentials are distributed around Σi would allow to train ${ \mathcal F }$ and ${ \mathcal B }$ to predict f and b0, respectively, which finally yields scattering lengths in vicinity of Σi. This procedure is, however, not even limited to the description of scattering lengths and can be generalized to arbitrary effective range parameters, since these give rise to movable singularities, as well. To give an example, we briefly consider the effective range r0, which diverges at the zero-crossing of a0. In analogy to the unitary limit surfaces Σi, we could, therefore, define the ith zero-crossing surface Σi′ as the d−1 dimensional manifold in Ωi, that contains all potentials with i bound states and a vanishing scattering length ${a}_{0}\to 0$, such that $| {r}_{0}| \to \infty $. Here, we could define f′ by scaling potentials onto a particular surface, that is ${f}^{{\prime} }{\boldsymbol{U}}\in {{\rm{\Sigma }}}_{i}^{{\prime} }$ for all ${\boldsymbol{U}}\in {\rm{\Omega }}$, and subsequently ${b}_{0}^{{\prime} }={r}_{0}(1-{f}^{{\prime} })$. From this point, all further steps should be clear. Even beyond unitary limit and zero-crossing surfaces, analyzing how the effective range behaves under the presented scaling operation and interpreting the outputs of the corresponding neural networks seems to be an interesting further step from here in order to investigate effective range effects and deviations from exact scale invariance.

A downside of the presented method is that it is, as defined above, only capable of approaching one movable singularity Σi. In the case of scattering lengths, this is because b0 diverges at each other unitary limit surface ${{\rm{\Sigma }}}_{{i}^{{\prime} }}$ with ${i}^{{\prime} }\ne i$ due to the divergence of a0 and $f\rlap{/}{\approx }1$. Let us define ${{\rm{\Omega }}}_{i}^{{\prime} }$ as the subset of all potentials ${\boldsymbol{U}}\in {\rm{\Omega }}$ that are surrounded by the above mentioned zero-crossing surfaces ${{\rm{\Sigma }}}_{i}^{{\prime} }$ and ${{\rm{\Sigma }}}_{i+1}^{{\prime} }$, that is for all ${\boldsymbol{U}}\in {{\rm{\Omega }}}_{i}^{{\prime} }$ there are α<1 and β≥1, such that $\alpha {\boldsymbol{U}}\in {{\rm{\Sigma }}}_{i}^{{\prime} }$ and $\beta {\boldsymbol{U}}\in {{\rm{\Sigma }}}_{i+1}^{{\prime} }$. The problem can be solved by redefining f to scale potentials between two zero-crossing surfaces onto the enclosed unitary limit surface ${{\rm{\Sigma }}}_{i+1}$, that is $f{\boldsymbol{U}}\in {{\rm{\Sigma }}}_{i+1}$ for all ${\boldsymbol{U}}\in {{\rm{\Omega }}}_{i}^{{\prime} }$. As a consequence, f becomes discontinuous and behaves similar to an inverted sawtooth function, which accordingly requires to involve discontinuous activations in the MLPs ${{ \mathcal F }}_{i}$. Note that even after the redefinition, b0 remains continuous as it vanishes on the zero-crossing surfaces due to a0=0, which is exactly where the redefined f has a jump discontinuity.

The idea to study manifolds in potential space does not need to be restricted to movable singularities of effective range parameters, but can be generalized to arbitrary contours of low-energy variables. To give an example, consider the d−1 dimensional hypersurface ${{\rm{\Sigma }}}_{i}^{(B)}$ that consists of all discretized potentials ${\boldsymbol{U}}\in {\rm{\Omega }}$ that give rise to i bound states and whose shallowest bound state has the binding energy B. Note that for B=0, this exactly reproduces the ith unitary limit surface ${{\rm{\Sigma }}}_{i}={{\rm{\Sigma }}}_{i}^{(0)}$. In this case, the shallowest bound state in a zero-energy bound state. Otherwise, that is if $B\ne 0$, the scattering lengths of all points on ${{\rm{\Sigma }}}_{i}^{(B)}$ must be finite and, thus, there cannot be an overlap between ${{\rm{\Sigma }}}_{i}^{(B)}$ and any unitary limit surface Σj, such that ${{\rm{\Sigma }}}_{i}^{(B)}\cap {{\rm{\Sigma }}}_{j}=\varnothing $ for all $i,j\in {\mathbb{N}}$. Here, unitary limit surfaces mark important boundaries between scattering states and bound state spectra: by crossing a unitary limit surface, a scattering state undergoes dramatic changes to join the spectrum as a new, shallowest bound state. When analyzing a given system in a finite periodic box, instead, zero-energy bound states resemble deeper bound states much more. In this context we refer to Lattice Monte Carlo simulations probing the unitary limit in a finite volume, see [24].

Appendix A. Preparation of data sets

As a consequence of discretization, the lth partial wave is defined piecewise: between the transition points rn−1 and rn it is given as a linear combination of spherical Bessel and Neumann functions,$\begin{eqnarray}{\phi }_{n}^{(l,k)}(r)={A}_{l,n}(k){{\rm{j}}}_{l}({k}_{n}r)-{B}_{l,n}(k){{\rm{n}}}_{l}({k}_{n}r),\end{eqnarray}$with the kinetic energy$\begin{eqnarray}{k}_{n}={{\rm{\Theta }}}_{\mathrm{Re}}(k)\sqrt{{k}^{2}+{U}_{n}}.\end{eqnarray}$Here we introduce the factor$\begin{eqnarray}{{\rm{\Theta }}}_{\mathrm{Re}}(k)=\left\{\begin{array}{ll}+1, & \mathrm{if}\ \mathrm{Re}(k)\geqslant 0\\ -1, & \mathrm{if}\ \mathrm{Re}(k)\lt 0\end{array}\right.\end{eqnarray}$to conserve the sign of k on the complex plane, that is ${k}_{n}\,\to k$, if Un vanishes. The parameters ${A}_{l,d+1}(k)$ and ${B}_{l,d+1}(k)$ completely determine the effective range function ${K}_{l}(k)\,=\,{k}^{2l+1}\,\cot {\delta }_{l}(k)$ due to their asymptotic behavior$\begin{eqnarray}{A}_{l,d+1}(k)={{\rm{e}}}^{{\rm{i}}{\delta }_{l}(k)}\cos {\delta }_{l}(k),\end{eqnarray}$$\begin{eqnarray}{B}_{l,d+1}(k)={{\rm{e}}}^{{\rm{i}}{\delta }_{l}(k)}\sin {\delta }_{l}(k).\end{eqnarray}$Instead of solving the Schrödinger equation for the step potential U(r), we apply the transfer matrix method [25] to derive ${A}_{l,d+1}(k)$ and ${B}_{l,d+1}(k)$. Due to the smoothness of the partial wave ${\phi }^{(l,k)}$ at each transition point rn, this method allows us to relate ${A}_{l,d+1}(k)$ and ${B}_{l,d+1}(k)$ to the initial parameters ${A}_{l,1}(k)$ and ${B}_{l,1}(k)$ via a product of transfer matrices ${M}_{l,n}(k)$. To arrive at a representation of these transfer matrices, we split up the mentioned smoothness condition into two separate conditions for continuity,$\begin{eqnarray}{\phi }_{n+1}^{(l,k)}({r}_{n})={\phi }_{n}^{(l,k)}({r}_{n})\end{eqnarray}$and differentiability,$\begin{eqnarray}{\left.\displaystyle \frac{{\rm{d}}}{{\rm{d}}r}{\phi }_{n+1}^{(l,k)}(r)\right|}_{r={r}_{n}}={\left.\displaystyle \frac{{\rm{d}}}{{\rm{d}}r}{\phi }_{n}^{(l,k)}(r)\right|}_{r={r}_{n}}\end{eqnarray}$at each transition point rn. Using equation (A1), we can combine both conditions (A6) and (A7) to a vector equation, that connects neighboring coefficients with each other:$\begin{eqnarray}\begin{array}{rcl} & & \mathop{\underbrace{\left(\begin{array}{cc}{{\rm{j}}}_{l}({k}_{n+1}{r}_{n}) & -{{\rm{n}}}_{l}({k}_{n+1}{r}_{n})\\ {k}_{n+1}{{\rm{j}}}_{l}^{{\prime} }({k}_{n+1}{r}_{n}) & -{k}_{n+1}{{\rm{n}}}_{l}^{{\prime} }({k}_{n+1}{r}_{n})\end{array}\right)}}\limits_{={m}_{l,n+1}({r}_{n},k)}\left(\begin{array}{c}{A}_{l,n+1}(k)\\ {B}_{l,n+1}(k)\end{array}\right)\\ & & \qquad =\mathop{\underbrace{\left(\begin{array}{cc}{{\rm{j}}}_{l}({k}_{n}{r}_{n}) & -{{\rm{n}}}_{l}({k}_{n}{r}_{n})\\ {k}_{n}{{\rm{j}}}_{l}^{{\prime} }({k}_{n}{r}_{n}) & -{k}_{n}{{\rm{n}}}_{l}^{{\prime} }({k}_{n}{r}_{n})\end{array}\right)}}\limits_{={m}_{l,n}({r}_{n},k)}\left(\begin{array}{c}{A}_{l,n}(k)\\ {B}_{l,n}(k)\end{array}\right).\end{array}\end{eqnarray}$Multiplying equation (A8) with ${m}_{l,n+1}^{-1}({r}_{n},k)$ from the left yields$\begin{eqnarray}\left(\begin{array}{c}{A}_{l,n+1}(k)\\ {B}_{l,n+1}(k)\end{array}\right)={M}_{l,n}(k)\left(\begin{array}{c}{A}_{l,n}(k)\\ {B}_{l,n}(k)\end{array}\right),\end{eqnarray}$which defines the nth transfer matrix$\begin{eqnarray}{M}_{l,n}(k)={m}_{l,n+1}^{-1}({r}_{n},k){m}_{l,n}({r}_{n},k).\end{eqnarray}$Therefore, ${A}_{l,d+1}(k)$ and ${B}_{l,d+1}(k)$ are determined by the choice of ${A}_{l,1}(k)$ and ${B}_{l,1}(k)$, which requires us to define two boundary conditions. Due to the singularity of nl in the origin, the spherical Neumann contribution in the first layer must vanish and therefore ${B}_{l,1}(k)=0$. The choice of ${A}_{l,1}(k)$ may alter the normalization of the wave function. However, since we only consider ratios of ${A}_{l,d+1}(k)$ and ${B}_{l,d+1}(k)$, we may opt for ${A}_{l,1}(k)=1$, which corresponds to$\begin{eqnarray}{\phi }_{1}^{(l,k)}(r)={{\rm{j}}}_{l}({k}_{1}r).\end{eqnarray}$Finally, applying all transfer matrices successively to the initial parameters yields$\begin{eqnarray}\left(\begin{array}{c}{A}_{l,d+1}(k)\\ {B}_{l,d+1}(k)\end{array}\right)=\left(\prod _{n=1}^{d}{M}_{l,n}(k)\right)\ \left(\begin{array}{c}1\\ 0\end{array}\right).\end{eqnarray}$

The most general way to derive any effective range expansion parameter ${Q}_{l}^{(j)}(\varkappa )$ for arbitrary expansion points $\varkappa \in {\mathbb{C}}$ in the complex momentum plane is a contour integration along a circular contour γ with radius κγ around ϰ. Applying Cauchy’s integral theorem then yields$\begin{eqnarray}{Q}_{l}^{(j)}(\varkappa )=\displaystyle \frac{1}{2\pi i}{\oint }_{\gamma }{\rm{d}}k\,\displaystyle \frac{{k}^{2l+1}}{{\left(k-\varkappa \right)}^{j+1}}\displaystyle \frac{{A}_{l,d+1}(k)}{{B}_{l,d+1}(k)}.\end{eqnarray}$We approximate this integral numerically over N grid points$\begin{eqnarray}{k}_{q}=\varkappa +{\kappa }_{\gamma }{{\rm{e}}}^{{\rm{i}}{q}2\pi /N},\qquad q=0,\ldots ,N-1.\end{eqnarray}$Smaller contour radii κγ and larger N thereby produce finer grids and decrease the approximation error. This way of calculating ${Q}_{l}^{(j)}(\varkappa )$ requires in total d×N transfer matrices. The numerical integration provides$\begin{eqnarray}{Q}_{l}^{(j)}(\varkappa )\approx \displaystyle \frac{1}{N{\left({\kappa }_{\gamma }\right)}^{j}}\displaystyle \sum _{q=0}^{N-1}{\left({k}_{q}\right)}^{2l+1}{{\rm{e}}}^{-{\rm{i}}{q}\tfrac{2\pi }{N}}\displaystyle \frac{{A}_{l,d+1}({k}_{q})}{{B}_{l,d+1}({k}_{q})}.\end{eqnarray}$Despite the generality of equation (A15), we restrict this analysis to S-wave scattering lengths a0=$-1/{Q}_{0}^{(0)}(0)$, since these dominate low-energy scattering processes.

While generating the training and test sets, we must ensure that there are no overrepresented potential shapes among the respective data set. To maintain complexity, this suggests generating potentials with randomized components Un. An intuitive approach therefore is to produce them via Gaussian random walks: given d normally distributed random variables X1, …, Xd,$\begin{eqnarray}{X}_{i}\sim \left\{\begin{array}{ll}{ \mathcal N }(0,\mathrm{ISF}) & \mathrm{if}\ i=1,\\ { \mathcal N }(0,\mathrm{SF}) & \mathrm{else},\end{array}\right.\end{eqnarray}$where ${ \mathcal N }(\mu ,\sigma )$ describes a normal distribution with mean μ and standard deviation σ, the distribution of the nth potential step Un can be described by the magnitude of the sum over all previous steps Xi,$\begin{eqnarray}{U}_{n}=\left|\displaystyle \sum _{i=1}^{n}{X}_{i}\right|.\end{eqnarray}$Note that, while all steps Xi in equation (A16) have zero mean, the standard deviation of the first step, which we denote by the initial step factor ISF, may differ from the standard deviation of all other steps, that we refer to as the step factor SF. This allows us to roughly control the shapes and depths of all potentials in the data set. Choosing ISF≫SF results more likely in potentials that resemble a potential well and expectedly yield similar scattering lengths. In contrast to that, SF≫ISF produces strongly oscillating potentials. We decide to choose the middle course ISF=1 and SF=0.75 (this is the case SF ≈ ISF) for two reasons: for one, from the perspective of depths, the corresponding Gaussian random walk is capable of generating potentials around the first unitary limit surface Σ1. For another, this choice of step factors causes the data set to cover a wide range of shapes from dominantly potential wells to more oscillatory potentials, which is an important requirement for generalization. This way, we generate 105 potentials for the training set and 104 potentials for the test set, To avoid overfitting to a certain potential depth, this needs to be followed by a rigorous downsampling procedure. For this, we use the average depth$\begin{eqnarray}u=\displaystyle \frac{1}{d}\displaystyle \sum _{n=1}^{d}{U}_{n}\end{eqnarray}$as a measure. Uniformizing the training and test set with respect to $\sqrt{u}$ on the interval [1.10,1.87] by randomly omitting potentials with overrepresented depths finally yields a training set T1, see figure A1, and test set T2 that contain 3×104 and 2.9×103 potentials, respectively. Scattering lengths are then derived using the numerical contour integration in equation (A15) with N=100 grid points and a contour radius of κγ=0.1. The derivation of the factors f is more involved: if the scattering length is negative (positive), the potential is iteratively scaled with the factor s=2 (s = 1/2), until its scattering length changes its sign. Let us assume the potential has been scaled t times this way. Then we can specify the interval where we expect to find f in as (2t−1, 2t] or as [2t, 21−t), respectively. Cubically interpolating 1/a0 on that interval using 25 equidistant values and searching for its zero finally yields the desired factor f.

Figure A1.

New window|Download| PPT slide
Figure A1.(a): Distribution of the square root $\sqrt{u}$ of the average depth over the training set. By construction, this distribution is uniform. (b): Bimodal distribution of the scattering length a0 over the training set. Note that extremely large scattering lengths are not displayed in this histogram.


Appendix B. Training by gradient descent

Given a data set $D\,\subseteq {\rm{\Omega }}\times {{\mathbb{R}}}^{n}$, there are several ways to measure the performance of a neural network ${ \mathcal N }:{\rm{\Omega }}\to {{\mathbb{R}}}^{n}$ on D. For this we have already introduced the MAPE that we have derived for the test set D = T2 after training. Lower MAPEs are thereby associated with better performances. Such a function $L:{\rm{\Gamma }}\to {{\mathbb{R}}}^{+}$ that maps a neural network to a non-negative, real number is called a loss function. The weight space Γ is the configuration space of the used neural network architecture and as such it is spanned by all internal parameters (e.g. all weights and biases of an MLP). Therefore, we can understand all neural networks ${ \mathcal N }\in {\rm{\Gamma }}$ of the given architecture as points in weight space. The goal all training algorithms have in common is to find the global minimum of a given loss function in weight space. It is important to note that loss functions become highly non-convex for larger data sets and deeper and more sophisticated architectures. As a consequence, training usually reduces to finding a well performing local minimum.

A prominent family of training algorithms are gradient descent techniques. These are iterative with each iteration corresponding to a step the network takes in weight space. The direction of the steepest loss descent at the current position ${ \mathcal N }\in {\rm{\Gamma }}$ is given by the negative gradient of $L({\boldsymbol{t}},{ \mathcal N }({\boldsymbol{U}}))$. Updating internal parameters along this direction is the name-giving feature of gradient descent techniques. This suggests the update rule$\begin{eqnarray}p\unicode{x027F5}p-\eta \displaystyle \frac{\partial L}{\partial p}({\boldsymbol{t}},{ \mathcal N }({\boldsymbol{U}}))\end{eqnarray}$for each internal parameter p, and by the left arrow $a\,\leftarrow \,b$ we denote the assignment ‘a=b’ as used in computer programming. Accordingly, the entire training procedure corresponds to a path in weight space. The granularity of that path is controlled by the learning rate η: smaller learning rates cause a smoother path but a slower approach towards local minima and vice versa. In any case, training only for one epoch, that is scanning only once through the training set, usually does not suffice to arrive near any satisfactory minima. A typical training procedure consists of several epochs.

Usually, the order of training samples $({\boldsymbol{U}},{\boldsymbol{t}})$ is randomized to achieve a faster learning progress and to make training more robust to badly performing local minima. Therefore, this technique is also called stochastic gradient descent. Important alternatives to mention are mini-batch gradient descent and batch gradient descent, where update steps are not taken with respect to the loss $L({\boldsymbol{t}},{ \mathcal N }({\boldsymbol{U}}))$ of a single sample, but to the batch loss$\begin{eqnarray}{L}_{D}({ \mathcal N })=\displaystyle \frac{1}{| D| }\displaystyle \sum _{({\boldsymbol{U}},{\boldsymbol{t}})\in D}L({\boldsymbol{t}},{ \mathcal N }({\boldsymbol{U}}))\end{eqnarray}$of randomly selected subsets D of the training set T1 with the batch size $| D| =B$ or the entire training set itself, respectively. There are more advanced gradient descent techniques like Adam and Adamax [26] that introduce a dependence on previous updates and adapted learning rates. It is particularly recommended to use these techniques when dealing with large amounts of data and high-dimensional weight spaces.

For training the members ${{ \mathcal F }}_{i}$ and ${{ \mathcal B }}_{i}$ of both ensembles ${ \mathcal F }$ and ${ \mathcal B }$, we apply the same training procedure using the machine learning framework provided by PyTorch [28]: weights and biases are initialized via the He-initialization [27]. We use the Adamax optimizer with the batch size B = 10 to minimize the L1-Loss,$\begin{eqnarray}{L}_{{\rm{L}}1}({\boldsymbol{t}},{ \mathcal N }({\boldsymbol{U}}))=\displaystyle \frac{1}{n}\parallel {\boldsymbol{t}}-{ \mathcal N }({\boldsymbol{U}})\parallel ,\end{eqnarray}$over 20 epochs. Here, we apply an exponentially decaying learning rate schedule ${\eta }_{t}=0.01\ \exp (-t/2)$. In this case, the decreasing learning rates allow a much closer and more stable approach towards local minima.

Acknowledgments

We thank Hans-Werner Hammer and Bernard Metsch for useful comments. We acknowledge partial financial support from the Deutsche Forschungsgemeinschaft (Project-ID 196253076 - TRR 110, ‘Symmetries and the Emergence of Structure in QCD’), Further support was provided by the Chinese Academy of Sciences (CAS) President’s International Fellowship Initiative (PIFI) (Grant No. 2018DM0034), by EU (Strong2020) and by VolkswagenStiftung (Grant No. 93562).


Reference By original order
By published year
By cited within times
By Impact factor

Radovic A Williams M Rousseau D et al. 2018 Nature 560 41
DOI:10.1038/s41586-018-0361-2 [Cited within: 1]

Richards J W et al. 2011 Astrophys. J. 733 10
DOI:10.1088/0004-637X/733/1/10 [Cited within: 1]

Buckley A Shilton A White M J 2012 Comput. Phys. Commun. 183 960
DOI:10.1016/j.cpc.2011.12.026 [Cited within: 1]

Graff P Feroz F Hobson M P Lasenby A N 2014 Mon. Not. R. Astron. Soc 441 1741
DOI:10.1093/mnras/stu642 [Cited within: 1]

Carleo G Troyer M 2017 Science 355 602
DOI:10.1126/science.aag2302 [Cited within: 1]

Mills K Spanner M Tamblyn I 2017 Phys. Rev. A 96042113
DOI:10.1103/PhysRevA.96.042113

Wetzel S J Scherzer M 2017 Phys. Rev. B 96184410
DOI:10.1103/PhysRevB.96.184410

He Y H 2017 Phys. Lett. B 774 564
DOI:10.1016/j.physletb.2017.10.024

Fujimoto Y Fukushima K Murase K 2018 Phys. Rev. D 98023019
DOI:10.1103/PhysRevD.98.023019

Wu Y Zhang P Shen H Zhai H 2018 Phys. Rev. A 98010701
DOI:10.1103/PhysRevA.98.010701 [Cited within: 1]

Niu Z M Liang H Z Sun B H Long W H Niu Y F 2019 Phys. Rev. C 99064307
DOI:10.1103/PhysRevC.99.064307

Brehmer J Cranmer K Louppe G Pavez J 2018 Phys. Rev. Lett. 121111801
DOI:10.1103/PhysRevLett.121.111801

Steinheimer J Pang L Zhou K Koch V Randrup J Stoecker H 2019 J. High Energy Phys.JHEP12(2019)122
DOI:10.1007/JHEP12(2019)122

Larkoski A J Moult I Nachman B 2020 Phys. Rep. 841 1
DOI:10.1016/j.physrep.2019.11.001 [Cited within: 1]

Baker G A 1999 Phys. Rev. C 60.5 054311
DOI:10.1103/PhysRevC.60.054311 [Cited within: 1]

Efimov V 1970 Phys. Lett. 33B 563
DOI:10.1016/0370-2693(70)90349-7 [Cited within: 1]

Heiselberg H 2002 Phys. Rev. A 63043606
DOI:10.1103/PhysRevA.63.043606

Braaten E Hammer H-W 2006 Phys. Rep. 428 259
DOI:10.1016/j.physrep.2006.03.001

Bulgac A Drut J E Magierski P 2006 Phys. Rev. Lett. 96090404
DOI:10.1103/PhysRevLett.96.090404

Lee D 2006 Phys. Rev. B 73115112
DOI:10.1103/PhysRevB.73.115112

König S Grießhammer H W Hammer H W van Kolck U 2017 Phys. Rev. Lett. 118202501
DOI:10.1103/PhysRevLett.118.202501 [Cited within: 1]

Kraemer T Mark M Waldburger P et al. 2006 Nature 440 315
DOI:10.1038/nature04626 [Cited within: 1]

Barron J T 2017arXiv:1704.07483 [cs.LG]
[Cited within: 1]

Endres M G Kaplan D B Lee J W Nicholson A N 2013 Phys. Rev. A 87023615
DOI:10.1103/PhysRevA.87.023615 [Cited within: 1]

Jonsson B Eng S T 1990 IEEE J. Quantum Electron. 26 20252035
DOI:10.1109/3.62122 [Cited within: 1]

Kingma D P Ba J 2017arXiv:1412.6980 [cs.LG]
[Cited within: 1]

He K Zhang X Ren S Sun J 20152015 IEEE Int. Conf. on Computer Vision (ICCV)10261034
[Cited within: 1]

Paszke A et al. 2019 Advances in Neural Information Processing Systems (NeurlPS) vol 32 80248035
[Cited within: 1]

相关话题/machine learning conquers