In this talk, I present an information-theoretic view of data science based on the study of the following fundamental questions: What is the minimal amount of information necessary to solve an assigned inference problem? Given this minimal amount of information, is it possible to design a low-complexity algorithm? What are the trade-offs between the parameters at play (e.g., dimensionality of the problem, size of the data sample, complexity)?
As a case study, I consider the inference problem of fitting data with a linear combination of a large number of simple components. This idea lies at the core of a variety of methods, most notably two-layer neural networks, and also kernel regression, sparse deconvolution and boosting. More specifically, we want to learn a concave function by using bump-like neurons and fitting the centers of the bumps. The resulting optimization problem is highly non-convex, and little is known about convergence properties of gradient descent methods. Our main result is to show that stochastic gradient descent (SGD) converges to the global optimum at exponential rates. As a by-product of our analysis, we accurately track the dynamics of the SGD algorithm for a wide class of two-layer neural networks. To do so, we prove that, as the number of neurons goes large, the evolution of SGD converges to a gradient flow in the space of probability distributions. Furthermore, as the bump width vanishes, this gradient flow solves a viscous porous medium equation. The associated risk function exhibits a special property, known as displacement convexity, and recent breakthroughs in optimal transport theory guarantee exponential convergence for this limit dynamics. Numerical simulations demonstrate that the asymptotic theory captures well the behavior of stochastic gradient descent for moderate values of the parameters.