[SFdS] Information du groupe Risques AEF
WG Risk - 28 February 2024 - Prof. Claire Donnat

Dear All,

We have the pleasure thanks to the support of the ESSEC IDS dpt, Institut des Actuaires, Fondation des Sciences de la Modélisation (CY - Labex MME-DII), the group Risques AEF (SFdS), to invite you to the seminar by:

Prof. Claire Donnat
Department of Statistics, University of Chicago, USA

Date: Wednesday, 28 February 2024, at 12:30pm (Paris) and 7:30pm (Singapore)

Dual format: ESSEC Paris La Défense (CNIT), Room TBA
and via Zoom, please click here

« Sparse topic modeling via spectral decomposition and thresholding »

By modeling documents as mixtures of topics, Topic Modeling allows the discovery of latent thematic structures within large text corpora, and has played an important role in natural language processing over the past decades. Beyond text data, topic modeling has proven itself central to the analysis of microbiome data, population genetics, or, more recently, single-cell spatial transcriptomics. Given the model’s extensive use, the development of estimators particularly those capable of leveraging known structure in the data presents a compelling challenge. In this talk, we focus more specifically on the probabilistic Latent Semantic Indexing model, which assumes that the expectation of the corpus matrix is low-rank and can be written as the product of a topic-word matrix and a word-document matrix. Although various estimators of the topic matrix have recently been proposed, their error bounds highlight a number of data regimes in which the error can grow substantially particularly in the case where the size of the dictionary p is large. In this talk, we propose studying the estimation of the topic-word matrix under the assumption that the ordered entries of its columns rapidly decay to zero. This sparsity assumption is motivated by the empirical observation that the word frequencies in a text often adhere to Zipf’s law. We introduce a new spectral procedure for estimating the topic-word matrix that thresholds words based on their corpus frequencies, and show that its l1-error rate under our sparsity assumption depends on the vocabulary size p only via a logarithmic term. Our error bound is valid for all parameter regimes and in particular for the setting where p is extremely large. Our procedure also empirically performs well relative to well-established methods when applied to a large corpus of research paper abstracts, as well as the analysis of single-cell and microbiome data where the same statistical model is relevant but the parameter regimes are vastly different.

Kind regards,
Jeremy Heng, Olga Klopp, Roberto Reno, and Marie Kratz
and Riada Djebbar (Singapore Actuarial Society - ERM)

SFdS - Société Française de Statistique
©2024 SFdS