Large Dimensional Statistics for AI

DESCRIPTION

Motivation and Key Objectives:
At the core of AI lies a set of elaborate non-linear, data-driven or implictly-defined machine learning methods and algorithms. The latter however largely rely on “small dimensional intuitions” and heuristics which are mostly inappropriate and behave strikingly differently in large dimensions. This “curse” of dimensionality notably explains the failure of kernel methods and the difficulty to fathom the powerful deep networks. Simultaneously, the strong pressure and demands in reliable AI tools from acamedia and companies alike induces an unprecedented need for novel mathematical tools and theoretical guarantees in machine learning algorithms.
Recent advances in large dimensional statistics, and particularly in random matrices and statistical physics, have provided important clues and striking first results on the understanding and directions of improvements for machine learning methods in large dimensions: a new approach to kernel methods is emerging, a renewed methodology for semi-supervised learning is born, and first advances in the difficult deep neural network theory have appeared. More surprisingly, by exploiting a notion of universality, statistical advances in random matrix theory are increasingly shown to be robust to real datasets and thus powerful practical performance predictors. Many of these recent findings are at the initiative of the members of the present project.

Chair Description and Program:
The LargeDATA chair is anchored on these recent breakthroughs and proposes to develop a consistent mathematical framework for the large-dimensional analysis, improvement and renewed design of basic-to-advanced data processing methods. LargeDATA is a theoretical chair, organized around (i) two methodological work packages on the development and application of random matrix theory and statistical physics to machine learning for large datasets, and (ii) a “transfer” work package that elaborates both on more realistic data models and on applications to real data (see detail below).
The long-term ambition is for the chair to hold a leading core-AI position within the MIAI institute providing world-class mathematical advances and attractivity in the automated treatment of large, numerous and dynamical data.

ACTIVITIES

WP1. Random Matrix Theory (RMT) for AI. This workpackage analyses non-linear matrix models (kernel random matrices, random graphs, random neural networks models, large random tensors) [Comon,Couillet,Tremblay] and their implications to related machine learning algorithms (LS-SVM, SSL, spectral clustering, neural nets, DPP sampling) [Amblard,Barthleme,Couillet]. It also studies the performance of implicit solutions to large dimensional optimizations for classification and regression (SVM, logistic regression, GLMM) [Chatelain,Couillet].

WP2. Statistical Physics and Graphs in AI. This workpackage specifically explores statistical physics (and related heuristics) in scenarios where mathematical developments and techniques are difficult or currently inaccessible [Barthelme,Couillet,Tremblay]. This concerns specifically statistical models of sparse random matrices and graphs, and complex non-linear learning methods such as deep neural networks.

WP3. Universality Results: from Theory to Practice. This workpackage develops theoretical grounds to apply large dimensional statistics to more practical considerations (such as in brain signal processing [Barthelme]). The focus will be on concentration of measure theory for neural network analysis [Couillet], sparse matrix and network analyses for clustering, graph mining [Couillet,Tremblay], as well as on heuristic methods to better grasp the most challenging real data and algorithm models (e.g., for deep learning methods) [Couillet,Tremblay].

CHAIR EVENTS

The chair organizes regular meetings with its main industrial collaborators HUAWEI Labs Paris, CEA Leti/List, and ST-Microelectonics.

A merged GAIA (at GISPA-lab) and LargeDATA chair seminar takes place on a weekly basis at GIPSA-lab.

SELECTED LIST OF PUBLICATIONS 

  • R. Couillet, F. Chatelain, N. Le Bihan, "Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clustering", International Conference on Machine Learning (ICML’21), virtual conference, 2021. [article|notebook]

  • Ch. Séjourné, R. Couillet, P. Comon, "A large-dimensional analysis of symmetric SNE", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), Toronto, Canada, 2021. [article]

  • M. Seddik, C. Louart, R. Couillet, M. Tamaazousti, "The Unexpected Deterministic and Universal Behavior of Large Softmax Classifiers", Artificial Intelligence and Statistics (AISTATS’21), virtual conference, 2021. [article]

  • M. Tiomoko, H. Tiomoko, R. Couillet, "Deciphering and Optimizing Multi-Task and Transfer Learning: a Random Matrix Approach", International Conference on Learning Representations (ICLR’21), virtual conference, 2021. Spotlight article. [article]

  • Z. Liao, R. Couillet, M. Mahoney, "Sparse Quantized Spectral Clustering", International Conference on Learning Representations (ICLR’21), virtual conference, 2021. Spotlight article. [article]

  • M. Seddik, R. Couillet, M. Tamaazousti, "A Random Matrix Analysis of Learning with α-Dropout", International Conference on Machine Learning (ICML’20), Artemiss workshop, Graz, Autria, 2020. [article]

  • Z. Liao, R. Couillet, M. Mahoney, "A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent", Conference on Neural Information Processing Systems (NeurIPS’20), Vacouver, Canada, 2020. [article]

  • T. Zarrouk, R. Couillet, F. Chatelain, N. Le Bihan, "Performance-Complexity Trade-Off in Large Dimensional Statistics", International Workshop on Machine Learning for Signal Processing (MLSP’20), Espoo, Finland, 2020. [article]

  • M. Tiomoko, C. Louart, R. Couillet, "Large Dimensional Asymptotics of Multi-Task Learning", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20), Barcelona, Spain, 2020. [article]

  • L. Dall’Amico, R. Couillet, N. Tremblay, "A unified framework for spectral clustering in sparse graphs", Journal of Machine Learning Research, vol. 22, no. 187, pp. 1-56, 2022. [article]

  • L. Dall’Amico, R. Couillet, N. Tremblay, "Nishimori meets Bethe: a spectral method for node classification in sparse weighted graphs", (to appear in) Journal of Statistical Mechanics: theory and experiment, 2021. [article]

  • L. Dall’Amico, R. Couillet, N. Tremblay, "Community detection in sparse time-evolving graphs with a dynamical Bethe-Hessian", Conference on Neural Information Processing Systems (NeurIPS’20), Vacouver, Canada, 2020. [article]

  • L. Dall’Amico, R. Couillet, N. Tremblay, "Optimal Laplacian Regularization for Sparse Spectral Community Detection", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20), Barcelona, Spain, 2020. [article|video]

  • R. Couillet, Y. Cinar, E. Gaussier, M. Imran, "Word Representations Concentrate and This is Good News!", SIGNLL Conference on Computational Natural Language Learning (CoNLL’20), virtual conference, 2020. [article]

  • M. Seddik, R. Couillet, M. Tamaazousti, "Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures", International Conference on Machine Learning (ICML’20), Graz, Autria, 2020. [article]
Published on  January 9, 2024
Updated on January 9, 2024