Viroscopy - 2009/12

Tamis - 2007/10

Bemol 2009/11

The TAMIS Project involves members of the Statistics group from Laboratoire de Probabilités et Modèles Aléatoires (UMR 7599 CNRS, Université Paris 6 et 7), of the Bioinformatics Department from Curie Foundation, of the Met@risk group from INRA and R&D team of Pertinence company. Those people have already enjoyed working together. They share a common interest in Mathematical and Applied Statistics. The TAMIS project turns around three questions from Statistics : the multiple testing problem, the ranking problem and adaptivity to sparsity in estimation. During the last decade, those three topics have intensively investigated in Statistics and in Statistical Learning Theory. Adaptivity to sparsity is a pivotal theme in wavelet estimation. The Gaussian sequence model provides a clean theoretical testbed for this problem. In the framework of constructive approximation theory, the attempts to capture sparsity throughout the definition of relevant functional spaces has been quite successful. But nowadays, adaptivity to sparsity is also considered as a multiple testing problem. This domain is a playground for Empirical Bayes methods, Thresholding, frequentist False Discovery Rate control procedures... The TAMIS project aims at further exploring and relating those adaptivity questions with practical issues stemming from Genomics, food safety risk control, and manufacturing process optimization. If features/variable selection methods for process optimization or food safety risk control clearly fit into the same family as adaptivity to sparsity (they all suffer from the curse of dimensionality), the connection to multiple testing has been perceived more recently (even though some model selection methods like pre-testing are defined methods to combine the results of multiple tests). The multiple testing problem (how should we use the p-values when many tests are performed simultaneously, while both the false-alarm rate and detection power have to remain reasonable?) has become fashionable because it emerges naturally in data mining (post-hoc analysis) and DNA-microarray analysis. We aim at further investigating FDR control procedures, their relation to adaptivity to sparsity and to use them on an important issue in Oncology: analysis of correlation between structural genome alterations and differential expression for certain classes of tumors. Tackling those questions is only possible through the collaboration between statisticians and computer scientists with direct access to clinical and biological data. Facing a multiple testing problems, one may also look at ranking : how should data be sorted in order to comply with the order induced with the probability not to satisfy the null hypothesis (which is usually unknown) ? This question generalizes the core questions of Statistical Learning Theory. It is relevant to the manufacturing process optimization problem and also to the food safety risk control problem. The TAMIS project members intend to go along working on ranking and to confront their algorithms on the data that will be made accessible through the project. They want to tackle the ROC curve estimation problem. With respect to all these three themes, the TAMIS project members work as statisticians, they aim at proving the existence of sound procedures (with guaranteed convergence rates and hopefully matching lower bounds...). But they also behave as computer scientists (some of the project members are or were educated as computer scientists): they are interested in computationally feasible procedures. The project members are also committed to confront theory with data.


The Viroscopy project essentially aims at developing stochastic mathematical models for the spread of transmissible infectious diseases, together with dedicated statistical methodologies, with the intent to deliver efficient diagnostic/prediction tools for epidemiologists. In our era of massive automatic data collection, many measurements related to the spread of infections are now systematically obtained and gathered in databases, without necessarily knowing which ones will be relevant for describing/understanding/predicting the epidemic of interest. The enormous progress made in the last ten years for gathering such data encourages applied mathematicians and epidemiologists to develop new models, incorporating more features in order to account for real-life situations. Although many variants of the standard SIR model have been proposed in the Biostatistics literature during the last two decades (far too numerous for being listed here), the question of analyzing epidemic data in its whole complexity remains a very challenging task. Recent advances in Probability & Statistics suggest that new analytical and computational tools, based on computer-intensive simulation methods interpretable in terms of interacting particle systems for instance, may be implemented from epidemic data in order to produce useful estimates for epidemiologists and the public-health community. The design of mathematical methodologies for analyzing epidemic data and tackling important questions related to epidemiological modelling being at the center of the present research project, the latter is clearly involved in interdisciplinary fundamental research. Beyond expected wide-reaching results (guaranteeing a validity framework for the models considered and convergence of numerical procedures elaborated), the Viroscopy project members are also committed to confront theory with data.

In the present context of ubiquitous data, there is a growing need for automatic filtering of relevant information for every individual internet user. Adapting the interface to the individual preferences of a potential buyer is of major concern for commercial websites, but, more broadly, the idea of automatically customized portals adjusting to the cognitive profile of a user can be seen as an important step towards the semantic web. The main keyword which refers to this problem is recommendation (computer programs performing this task are known as recommender systems). Depending on the context, various constraints and sources of data can be taken into consideration. For instance, Google’s adsense is based on parsing webpages in order to select relevant keywords in the current webpage and customize sponsored links accordingly, while a commercial website with an authentication process may use individual purchase data to provide recommendations. The present evolution of recommender systems follows the collaborative filtering approach. Under the latter, the profile of the active web user is compared to other users with similar preferences in order to come up with new recommendations. Today, innovation and research are centered around the construction and the analysis of this measure of similarity integrating diverse sources of data (browsing data, consumer's profile data, declarative data, individual questionnaires, marketing scores...). In this context, very recent research from the domain of statistical learning theory suggests that important breakthroughs could be achieved with recommender systems when confronted to heterogeneous data and recommendations with multiple criteria. The purpose of the project is to develop conceptual and algorithmic tools for automatic inference of users behavior. The approach will rely on: (1) experts advice on collaborative filtering for emarketing applications, (2) statistical modelling and forecasting in order to exploit the massive and heterogeneous databases available for this project. Working hypotheses and predictive models based on learning theory will be confronted to real and simulated datasets. The core of the project is to be considered as fundamental research since we aim at the extraction of generic mechanisms of inference of macroscopic behavior from low-level information encrypted in raw data. Besides, we underline the fact that in this research, ideas need to be validated on real data and the development of software applications plays a key role both for research and marketing applications.

Tangerine 2010/13

Data is often nonnegative by nature. Consider for example pixel intensities, amplitude spectra, occurence counts, food consumption, user scores or stock market values. Thus, optimal processing of such data may call for processing under nonnegativity constraints. As such, nonnegative matrix factorization (NMF) is a linear regression technique with effervescent popularity in the fields of machine learning and signal/image processing. Given a data matrix V of dimensions F-by-N with  nonnegative entries, NMF is the problem of finding a factorization V ≈WH where W and H are nonnegative matrices of dimensions F-by-K and K-by-N, respectively. K is usually chosen such that F K + K N << F N, hence reducing the data dimension. The factorization is in general only approximate, so that the terms “approximate nonnegative matrix factorization” or “nonnegative matrix approximation” also appear in the literature. Along Vector Quantization (VQ), Principal Component Analysis (PCA) or Independent Component Analysis (ICA), NMF provides an unsupervised linear representation of data, in the sense that a data point vn (nth column of V) is approximated as a linear combination of salient features : v ≈ W h. The main novelty of NMF with respect to VQ, PCA or ICA is that it keeps W and hn nonnegative, hence improving the interpretability of the learnt dictionary and of the activation coefficients. For example, if V is the food consumption of a population of N individuals, W will express typical dietary behaviours. Thus there is no reason for W to have negative values. In their landmark paper published

in Nature, Lee and Seung (1999) illustrate the concept of NMF with the decomposition of MIT’s Center for Biological and Computational Learning (CBCL) face dataset. The reported results show how the non-subtractive constraint imposed by NMF produces part-based representation of data. The dictionary learnt by NMF is composed of parts of faces : noses, mouths, pairs of eyes, pairs of cheeks, pairs of eyebrows, pair of ears, etc. In contrast, the elements of the PCA dictionary (the “eigenfaces”) contain negative values and are not easily interpretable per se. The growing interest in NMF since the publication of (Lee and Seung, 1999) is reflected by the growing number of publications related to this subject since 1999. We ran a search in the ISI Web of Science database over the time range 1999-2007 and were returned 518 publications.2 Figure 2 displays statistics about the number of papers published per year and geographic origin. It reveals a clear upside in the popularity trend of NMF, and also the lack of significant work effort about the subject in France. The database reports only 13 publications (co-)signed by a French institution, and most of them are mere applications of existing algorithms to specific applications rather than strong methodological contributions. The TANGERINE project intends to fill this gap.

ANR Project Blank

N° LAN06-2_134570

Name: Adaptation, multiple testing, ranking and applications

Acronym: TAMIS

Coordinator: Stéphane Boucheron

Length: 36 months

Institutes: UPMC Paris 6 - Denis Diderot Paris 7 - INRA - S.A. Pertinence - Institut Curie



Name: Stochastic Modelling and Statistical Inference for the Spread of Infectious Communicable Diseases:     from the Microscopic View to Macroscopic Approximation

Acronym: Viroscopy

Coordinator: Stéphan Clémençon

Length: 36 months

Institutes: Telecom ParisTech - Université René Descartes - INRIA Sud-Ouest - Université Lille 1 - Universidad La Habana


Name: Forecasting of internet users behavior – data aggregation,statistical learning models, simulation and collaborative filtering

Acronym: Bemol

Coordinator: N.Vayatis

Length: 24 months

Institutes: ENS Cachan - Telecom ParisTech - Sté Mille Mercis

ANR Project «Young Researchers»

Name: Theory and Applications of Nonnegative Matrix Factorization

Acronym: Tangerine

Coordinator: C. Févotte

Length: 36 months

Institutes: Telecom ParisTech - INRA


ACI - Nouvelles Interfaces des Mathématiques

N° 0437

Name: Epidémiologie de l'infection du VIH à Cuba : modélisation stochastique et prévision.

Coordinator: Stéphan Clémençon

Length: 36 months

Institutes: Université Paris X - INSERM - Université Paris V

Modélisation épidémiologique - 2004/07



Name: Ranking et sélection automatique de modèle : théorie et algorithmes

Acronym: Crank-Up

Coordinator: N.Usunier

Length: 18 months

Institutes: Université Pierre et Marie Curie - Telecom ParisTech - GIS PARISTIC

MetaboMine 2011

Project INRA

Name: Automatic recognition of phenotypes based on metabolomic observations

Acronym: MetaboMine

Coordinators: S. Clémençon, A. Paris

Length: 3 months

Institutes: INRA - Telecom ParisTech

Le projet de cette ACI consiste à développer un modèle mathématique permettant de rendre compte de l'évolution récente de l'épidémie du SIDA à Cuba, et de l'anticiper tout à la fois. La modélisation statistique de l'épidémie pourra s'appuyer sur l'atout majeur que constitue l'accès à la base de données établie par le Sanatorium de Santiago de Las Vegas (Cuba) pour le contrôle épidémiologique du virus. Unique en son genre, cette base de données contient non seulement les informations médicales, socio-démographiques et comportementales relatives aux individus infectés par le VIH détectés par le système de santé publique cubain, mais aussi une liste des partenaires sexuels (fournie sur la base du volontariat) de chacun de ces individus: une caractéristique essentielle du système de lutte contre le SIDA mis en place à Cuba consistant en la recherche active des contacts sexuels des personnes infectées. Sur la base de cette source exceptionnelle d'information, cette ACI visera à construire un modèle de population structurée pour l'évolution de l'épidémie, reposant sur la description des comportements individuels par des processus stochastiques markoviens. La diversité de la population pourra être prise en compte en caractérisant chaque individu par des variables identifiées comme pertinentes du point de vue épidémiologique, parmi les caractères recensés dans la base de données. La difficulté de ce projet réside principalement dans la détermination préalable de ces variables d’état, dans la modélisation de leurs effets sur l'évolution des individus et leurs interactions, une telle modélisation devant naturellement permettre l'élaboration de stratégies d'estimation statistique consistantes ainsi que la mise en oeuvre de méthodes numériques de simulation. Le problème de l’inférence statistique (estimation et détermination d’intervalles de confiance) pour de tels modèles constitue un sujet d’investigation nouveau et représente un véritable challenge du point de vue mathématique. Au delà des simples questions de santé publique à Cuba et d’évaluation de la politique de lutte contre le SIDA spécifique qui y est menée, ce projet permettra de plus généralement de mettre en lumière les phénomènes de transmission et d’évolution de la maladie.

Apprentissage & Réseaux - 2008/09

Project Futur & Rupture - Institut Telecom

Name: Machine Learning & Networks

Coordinator: S. Clémençon

Length: 10 months

Institutes: Telecom ParisTech - INRA - UC San Diego


Le projet « Apprentissage et Réseaux » vise à développer et mettre en oeuvre des méthodes d'apprentissage (machine-learning) permettant de décrire statistiquement les mécanismes de propagation de l'information au sein d'un réseau social, représenté par un graphe aléatoire, dont les sommets sont « étiquetés » afin de prendre en compte les caractéristiques des individus formant le réseau. En particulier, nous nous attacherons à développer des procédures statistiques (tests d'homogénéité) visant à mettre en évidence l'impact éventuel des conditions initiales (la façon dont l'information est initialement disséminée au sein du réseau: où et comment) sur la vitesse/nature de la propagation de l'information à travers le réseau. Si la problématique considérée couvre de nombreux champs thématiques, l'application considérée concerne ici la « communication des informations relatives aux risques alimentaires ». Elle s'appuiera sur la base de données élaborée par l'INRA, partenaire du projet.


Le projet vise à développer des outils de reconnaissance automatique d’un phénotype à partir d’observations métabolomiques, au moyen de concepts élaborés récemment en machine-learning (e.g. optimisation de la courbe ROC, ranking « multi-classe », analyse de données fonctionnelles).

Alors que l’on dispose aujourd’hui d’un large éventail d’algorithmes pour traiter le problème majeur de la « Reconnaissance de Formes », la classification binaire, et d’un cadre théorique afférent satisfaisant, les problèmes de « ranking » proposent de nouveaux défis scientifiques et technologiques, situés dans le champ de l'apprentissage automatique. Dans de nombreuses applications (moteurs de recherche, moteurs de recommandation, etc.), il ne s’agit plus d’apprendre à ‘affecter’ un label à des observations (‘document pertinent’ vs ‘document non pertinent’ par exemple) mais à ranger/ordonner les observations les unes par rapport aux autres : dans le contexte des moteurs de recherche, on visera par exemple à ranger les documents selon leur degré de pertinence pour une certaine requête. Des progrès significatifs ont récemment été réalisés dans ce domaine, les membres de la présente équipe, réunie dans le cadre du GIS PARISTIC (groupe Apprentissage), ont en

particulier largement contribué ces dernières années à la mise au point de nombreux algorithmes de ranking et à leurs applications, au domaine de la recherche automatique de contenus (documents numériques, multimédia, etc.). Leur mise en application requiert toutefois la spécification de paramètres cruciaux, réglant la complexité des règles de ranking produites. Le problème de la sélection de modèle vise précisément à automatiser ce choix, afin d’éviter le sur/sous – ajustement aux données d’apprentissage. Dans le domaine de la régression/classification, des questions semblables ont récemment connu des avancées très importantes, fondées sur les techniques de « bootstrap », remettant en cause les méthodes de pénalisation dites « structurelles » qui s'étaient imposées dans les années 1990 avec les travaux de V. Vapnik. La calibration des pénalités est alors guidée par une nouvelle heuristique dite de pente. Le but de ce projet est de lever les verrous scientifiques afin d’étendre ces progrès au problème d’ « apprentissage global » qu’est le ranking.

Crank-Up - 2011/12


Erasm - 2012/14


Acronym: ERASM

Coordinator: J.L. Liévin

Length: 24 months

Institutes: SA IdexLab - Mendeley - Telecom ParisTech



Stéphan Clémençon


Stéphan Clémençon