Gradient Descent with Periodic Entrainement on the Loss Surface of Neural Networks:
Consider the
entrained system with initial condition \(y_0\) (see
[1]):
\begin{equation}\label{eq:GDE0}
\dot{y}(t)=g(y(t))+\gamma(t)\varphi(t)
\end{equation}
with
\(g:\mathbb{R}^p\rightarrow\mathbb{R}^p\),
\(y(0)=y_0\in\mathbb{R}^p\) and
\(\gamma(t)=\alpha\beta(t)\) where
- \(\alpha\in\mathbb{R}_{\geq 0}\) is the amplitude factor,
- \(\beta:\mathbb{R}_{\geq 0}\rightarrow [0,1]\) the slow periodic function with period \(U\),
- \(\varphi:\mathbb{R}_{\geq 0}\rightarrow [-1,1]^p\) the fast periodic function with period \(T\)
We consider now a NN where the second layer is fixed and GD is applied on the first layer weights matrix (see
[2, 3]).
\begin{equation}\label{eqn:layer2}
\tilde{w}_{k+1}=\tilde{w}_k-h
\frac{\partial {\cal L}(\tilde{w}_k)}{\partial w_k}
+h\gamma(hk) \varphi(hk)
\end{equation}
where the gradient formula for each weight vector is
\begin{equation*}
\frac{\partial {\cal L}(w)}{\partial w^r}=\frac{1}{\sqrt{m}}\sum_{i=1}^nv^ia_r x_i \textcolor{blue}{\mathbb{I}}\{x_i^\top w^r\geq 0\}.
\end{equation*}
with \(r\in \{1,\dots,m\}\) and \(\textcolor{blue}{\mathbb{I}}\) is the indicator.
\(v(t)=(v^1(t),\dots,v^n(t))\in\mathbb{R}^{d\times n}\) is the
error function:
\begin{equation*}
v^i(t)=f(w(t),a,x_i)-y_i\in\mathbb{R}^d.
\end{equation*}
where \( f \) is a NN of the form:
\begin{equation}
f(w,a,x)=\frac{1}{\sqrt{m}}\sum_{r=1}^ma_r \sigma(x^\top w^r)
\end{equation}
\(x\in\mathbb{R}^d\) is the input,
\(w=(w^1,\dots,w^m)\in\mathbb{R}^{d\times m}\) is the weight vector
of the first layer, \(a_r\in\mathbb{R}\) is the output weight and \(\sigma(\cdot)\) is the ReLU activation function:
\(\sigma(z)=z\) if \(z\geq 0\) and \(\sigma(z)=0\) if \(z<0\).
To get the code source: Click Here
Numerical Example 1:
We consider The training has been conducted using \(n=10\) data points, \(m=20\) hidden nodes and \(d=3\).
We take \(h=0.15\), \(\varphi=\cos(\frac{t}{2})\) and \(\alpha=0\).
The initial conditions and the initial first layer weight vector \(w^r\) are picked randomly.
Results:
Numerical Example 2:
Now we consider the same example but with \(\alpha=0.01\) and \(\beta(t)\in [0,1]\) is a triangular signal of period \(U=2.5T=10\pi\).
Results:
Numerical Example 3:
Now we consider the same example but with \(\alpha=0.05\) and \(\beta(t)\in [0,1]\) is a triangular signal of period \(U=2.5T=10\pi\).
Results:
Numerical Example 4:
Now we consider the same example but with \(\alpha=0.07\) and \(\beta(t)\in [0,1]\) is a triangular signal of period \(U=2.5T=10\pi\).
Results:
Numerical Example 5:
We consider the same example but with differant inital conditions, \(\alpha=0.01\) and \(\beta(t)\in [0,1]\) is a triangular signal of period \(U=2.5T=10\pi\).
Results:
References:
[1] Jawher Jerray, Adnane Saoud, Laurent Fribourg: "Using Euler's Method to Prove the Convergence of Neural Networks" IEEE Control. Syst. Lett. 6 (2022).
[2] Jawher Jerray, Adnane Saoud, Laurent Fribourg: "Asymptotic error in Euler's method with a constant step size" Eur. J. Control 68 (2022).
[3] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh: "Gradient descent provably optimizes over-parameterized neural networks" CoRR, vol. abs/1810.02054, (2018).