Course title: Deep Learning Theory, MICAS 913

Program: MICAS: Master in Machine Learning, Communications and Security

Volume: 33h (lectures 30h, final exam 3h), 1 ECTS credit

Instructor: Mansoor Yousefi

TA: Jamal Darweesh

Office hours: 3D55, bi-weekly, Fridays 17h–18h30, as well as on Zoom


This is a graduate course on deep learning theory, an important topic in machine learning. The course consists of three parts on approximation theory, optimization theory and statistical theory.

The first part is dedicated to the study of the approximation error rates of the feed-forward neural networks (NNs). The second part is on the analysis of the gradient-based optimization algorithms, particularly the optimization error rates of the stochastic gradient descent (SGD) and its variants. In the third part of the course, generalization performance is studied, proving bounds on the generalization error of the NNs with i.i.d data.

The course includes a semester simulation project where students will apply NNs for receiver design in data transmission over a nonlinear communication channel. The course is complied from research papers on neural networks and deep learning theory. The students should read the reference papers before the class.


There are 10 lectures each 3h (2x1.5h), and a final exam.

  • Part I: Approximation theory

    • Lecture 1: Decomposition of risk in ERM, universal approximation

    • Lecture 2: Approximation rates of shallow NNs with broad class of activations

    • Lecture 3: Approximation rates of deep NNs with ReLU or piece-wise linear activation

  • Part II: Optimization theory

    • Lecture 4: Analysis of stochastic gradient descent, implementation of back-propagation

    • Lecture 5: Momentum, adaptive step size, accelerated algorithms

    • Lecture 6: Landscape of loss, convergence of local search methods, no spurious local minima results

  • Part III: Statistical theory

    • Lecture 7: Rademacher and VC generalization bounds for multi-layer NNs

    • Lecture 8: Neural tangent kernel (NTK), over-parameterization role

    • Lecture 9: Implicit bias of GD, linear and 2-layer infinitely-wide NNs, double-descent curve, benign over-fitting

    • Lecture 10: Architectures, focused on the transformer model

  • Final exam: Research project presentation

The program can be adjusted based on the pace of progress and students’ feedback.

Learning outcomes

The learning objectives of the course are as follows.

  • Understand the expressive power of NNs in approximating important functional classes

  • Demonstrate the universal approximation, when the number of neurons tends to infinity

  • Compute the approximation error rates of NNs with piece-wise linear activation, with finite neurons

  • Analyze the convergence rate of the gradient-based optimization algorithms

  • Explain momentum, adaptive step-size and accelerated gradient descent (GD)

  • Understand the landscape of loss, no spurious local minima results, role of positive homogeneity, and the success of the local search methods in non-convex NN optimization

  • Understand why deep learning generalization defies classical statistical learning theory, and obtain bounds on the generalization error in NNs

  • Linearize a NN, and analyze the excess risk of the resulting neural tangent kernel model

  • Deduce the implicit bias of the gradient descent in two models. Derive the double-descent curve, and understand the role of the over-parameterization in generalization (and in facilitating optimization).


The evaluation is based on a semester simulation project, as well as a final exam.

  • Simulation project (50%). The project is on deep learning of the nonlinear partial differential equations (PDEs). The class is partitioned into several groups, each consisting of two students. Each group designs a NN for equalization in data transmission over optical fiber modeled by the stochastic nonlinear Schrodinger (NLS) equation. The simulation project is explained initially in the class, and is gradually completed during the semester. A detailed guide is provided on the project where the steps are outlined. Students submit a final report, implementing the steps and answering questions in the guide. A quantitative scheme is provided for grading. The instructor will hold bi-weekly office hours to review and guide students’ progress and provide feedback.

  • Final exam (50%). This is a 3-hour written exam, with questions from the three parts of the class.


  • MICAS 911: Statistical machine learning

  • MICAS 901: Introduction to optimization

  • Programming in Julia or Python for the project


  • The primary references are research papers that will be provided before each lecture

  • M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations, Cambridge University Press, 1999