LargeData Chair

Description

Motivation and Key Objectives:
At the core of AI lies a set of elaborate non-linear, data-driven or implictly-defined machine learning methods and algorithms. The latter however largely rely on “small dimensional intuitions” and heuristics which are mostly inappropriate and behave strikingly differently in large dimensions. This “curse” of dimensionality notably explains the failure of kernel methods and the difficulty to fathom the powerful deep networks. Simultaneously, the strong pressure and demands in reliable AI tools from acamedia and companies alike induces an unprecedented need for novel mathematical tools and theoretical guarantees in machine learning algorithms.
Recent advances in large dimensional statistics, and particularly in random matrices and statistical physics, have provided important clues and striking first results on the understanding and directions of improvements for machine learning methods in large dimensions: a new approach to kernel methods is emerging, a renewed methodology for semi-supervised learning is born, and first advances in the difficult deep neural network theory have appeared. More surprisingly, by exploiting a notion of universality, statistical advances in random matrix theory are increasingly shown to be robust to real datasets and thus powerful practical performance predictors. Many of these recent findings are at the initiative of the members of the present project.

Chair Description and Program:
The LargeDATA chair is anchored on these recent breakthroughs and proposes to develop a consistent mathematical framework for the large-dimensional analysis, improvement and renewed design of basic-to-advanced data processing methods. LargeDATA is a theoretical chair, organized around (i) two methodological work packages on the development and application of random matrix theory and statistical physics to machine learning for large datasets, and (ii) a “transfer” work package that elaborates both on more realistic data models and on applications to real data (see detail below).
The long-term ambition is for the chair to hold a leading core-AI position within the MIAI institute providing world-class mathematical advances and attractivity in the automated treatment of large, numerous and dynamical data.

Activities

WP1. Random Matrix Theory (RMT) for AI. This workpackage analyses non-linear matrix models (kernel random matrices, random graphs, random neural networks models, large random tensors) [Comon,Couillet,Tremblay] and their implications to related machine learning algorithms (LS-SVM, SSL, spectral clustering, neural nets, DPP sampling) [Amblard,Barthleme,Couillet]. It also studies the performance of implicit solutions to large dimensional optimizations for classification and regression (SVM, logistic regression, GLMM) [Chatelain,Couillet].

WP2. Statistical Physics and Graphs in AI. This workpackage specifically explores statistical physics (and related heuristics) in scenarios where mathematical developments and techniques are difficult or currently inaccessible [Barthelme,Couillet,Tremblay]. This concerns specifically statistical models of sparse random matrices and graphs, and complex non-linear learning methods such as deep neural networks.

WP3. Universality Results: from Theory to Practice. This workpackage develops theoretical grounds to apply large dimensional statistics to more practical considerations (such as in brain signal processing [Barthelme]). The focus will be on concentration of measure theory for neural network analysis [Couillet], sparse matrix and network analyses for clustering, graph mining [Couillet,Tremblay], as well as on heuristic methods to better grasp the most challenging real data and algorithm models (e.g., for deep learning methods) [Couillet,Tremblay].

Chair events

The chair organizes regular meetings with its main industrial collaborators HUAWEI Labs Paris, CEA Leti/List, and ST-Microelectonics.

A merged GAIA (at GISPA-lab) and LargeDATA chair seminar takes place on a weekly basis at GIPSA-lab.

Scientific publications

2019

Estimation of Covariance Matrix Distances in the High Dimension Low Sample Size Regime hal-02965834

Revisiting the Bethe-Hessian: Improved Community Detection in Sparse Heterogeneous Graphs hal-02429525

Random matrix-improved estimation of covariance matrix distances hal-02355223

Random Matrix-Improved Estimation of the Wasserstein Distance between two Centered Gaussian Distributions hal-02965778

Classification spectrale par la laplacienne déformée dans des graphes réalistes hal-02153901

Random Matrix Improved Covariance Estimation for a Large Class of Metrics hal-02152121

A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights hal-02139980

Community Detection in Sparse Realistic Graphs: Improving the Bethe Hessian hal-02153916

Improved Estimation of the Distance between Covariance Matrices hal-02355321

Revisiting and improving semi-supervised learning: a large dimensional approach hal-02139979

A Large Dimensional Analysis of Least Squares Support Vector Machines hal-02048984

Concentration of Measure and Large Random Matrices with an application to Sample Covariance Matrices hal-02020287

A Kernel Random matrix-based approach for sparse PCA hal-02971198

Kernel Random Matrices of large concentrated data: The example of GAN-Generated Images hal-02971224

Pourquoi les matrices aléatoires expliquent l'apprentissage ? Un argument d'universalité offert par les GANs hal-02971207

2020

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent hal-02971807

Concentration of solutions to random equations with concentration of measure hypotheses hal-02973851

Performance-complexity trade-off in large dimensional statistics hal-02961057

Smoothing graph signals via random spanning forests hal-02956603

Large Dimensional Asymptotics of Multi-Task Learning hal-02965810

A Random Matrix Analysis of Learning with ?-Dropout hal-02971211

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures hal-02971185

Smoothing graph signals via random spanning forests hal-02319175