John Wilmes, PhD
Assistant Professor
Department of Mathematics, Brandeis University
October 1, 2018
Guarantees for Training Neural Networks
Before AI can take over the world, their neural networks must be trained. While at its most basic level a neural network maps an input to an output, deep learning can train these networks to classify, correlate, and even predict outcomes. Within these networks, variables are given different weights (a level of importance) which can hopefully decrease error rates. However, noise in the system can confuse the network, leading to an incorrect output (which could be disastrous for an AI such as a self- driving car!). Dr. Wilmes discussed his research on training neural networks using deep learning, reducing error rates, and the search for guarantees for this type of learning.
Deep learning has brought about stunning advances for practical artificial intelligence over the past decade in applications as diverse as autonomous vehicles, speech recognition and recommendation systems. Despite its impressive empirical success, deep learning algorithms lack a robust theoretical foundation. Provable guarantees for deep learning, that is, for training deep neural networks using variants of gradient descent, have been elusive.
In joint work with Santosh Vempala, we study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We obtain essentially matching upper and lower bounds. We begin with the observation that the concepts computed by simple neural networks are well-approximated by low-degree polynomials when inputs are drawn uniformly from the sphere, so regression problems for data labelled by an unknown neural network can be captured by agnostic polynomial learning. We give an agnostic learning guarantee for training neural networks via gradient descent, establishing convergence within error E of the optimum when the number of gates, samples, and iterations are all bounded by nO(k) log(1/E) for concepts approximated by degree-k polynomials. On the other hand, any statistical query algorithm (including gradient descent in its many variations) requires nΩ(k) queries for the same problem.
We also re-evaluate the success of deep learning in view of the ubiquitous phenomenon of adversarial inputs to deep learning models first observed by Szegedy et. al. in 2013. Although human vision appears to be quite robust to small amounts of L∞-noise, most deep learning models for image classification can be fooled by quite modest perturbations of correctly classified inputs. It remains an open question how precisely this susceptibility to noise arising from the training algorithm, and to what extent it can be avoided. We discuss future directions for formal guarantees against adversarial inputs.