Dear All,
We have the pleasure thanks to the support of the ESSEC IDS dpt, Institut des Actuaires, Fondation des Sciences de la Modélisation (CY  Labex MMEDII), the group Risques AEF (SFdS), to invite you to the seminar by:
Prof. Claire Donnat
Department of Statistics, University of Chicago, USA
Date: Wednesday, 28 February 2024, at 12:30pm (Paris) and 7:30pm (Singapore)
Dual format: ESSEC Paris La Défense (CNIT), Room TBA
and via Zoom, please click here
« Sparse topic modeling via spectral decomposition and thresholding »
By modeling documents as mixtures of topics, Topic Modeling allows the discovery of latent thematic structures within large text corpora, and has played an important role in natural language processing over the past decades. Beyond text data, topic modeling has proven itself central to the analysis of microbiome data, population genetics, or, more recently, singlecell spatial transcriptomics. Given the model’s extensive use, the development of estimators particularly those capable of leveraging known structure in the data presents a compelling challenge. In this talk, we focus more specifically on the probabilistic Latent Semantic Indexing model, which assumes that the expectation of the corpus matrix is lowrank and can be written as the product of a topicword matrix and a worddocument matrix. Although various estimators of the topic matrix have recently been proposed, their error bounds highlight a number of data regimes in which the error can grow substantially particularly in the case where the size of the dictionary p is large. In this talk, we propose studying the estimation of the topicword matrix under the assumption that the ordered entries of its columns rapidly decay to zero. This sparsity assumption is motivated by the empirical observation that the word frequencies in a text often adhere to Zipf’s law. We introduce a new spectral procedure for estimating the topicword matrix that thresholds words based on their corpus frequencies, and show that its l1error rate under our sparsity assumption depends on the vocabulary size p only via a logarithmic term. Our error bound is valid for all parameter regimes and in particular for the setting where p is extremely large. Our procedure also empirically performs well relative to wellestablished methods when applied to a large corpus of research paper abstracts, as well as the analysis of singlecell and microbiome data where the same statistical model is relevant but the parameter regimes are vastly different.
Kind regards,
Jeremy Heng, Olga Klopp, Roberto Reno, and Marie Kratz
https://crear.essec.edu/crearevents/workinggrouponrisk
and Riada Djebbar (Singapore Actuarial Society  ERM)
