Stage, Laboratoire MAP5, Université Paris Cité, 45 rue des saint-pères, Paris..Entreprise/Organisme : | Laboratoire MAP5, Université Paris Cité | Niveau d'études : | Master | Sujet : | **Context**: Extreme Value Theory (EVT) is a field of probability and statistics concerned with tails of distributions, that is, regions of the sample space located far away from the bulk, associated with rare and extreme events. Poviding probabilistic descriptions and statistical inference methods for the tails requires sound theoretical assumptions pertaining to the theory of regular variation and maximum domains of attraction, ensuring that a limit distribution of extremes exists. This setting encompasses a wide range of applications in various disciplines where extremes have tremendous impact, such as climate science, insurance, environmental risks and industrial monitoring systems [1].
In a supervised learning framework, the goal is to learn a good prediction function to predict new, unobserved labels. In many contexts (covariate-shifts, climate change), extrapolation (or out-of-sample) properties of the predictors thus constructed are crucial, and obtaining good generalization properties on unobserved regions of the covariate space is key. Recently, there has been significant interest in the ML literature regarding out-of-domain generalization (see e.g. [2]).
Recent works [3,4,5] focus on the problem of learning a tail predictor based on a small of the most, with non-asymptotic guarantees regarding the risk on extreme regions . For simplicity, the theoretical study in both works is limited to Empirical Risk Minimization (ERM) algorithms without a penalty term. In addition, the regression problem analysed in [5] covers least squares regression only. Also, with heavy-tailed targets, non-linear transformations of the target are required in order to satisfy boundedness assumptions.
**Research Objectives**: The general purpose of this internship and subsequent thesis is to extend the scope of applications of the supervised learning methods described above to a wider class of learning algorithms. One main limitation of least squares regression is that the optimal predictor (i.e. the conditional expectation given the covariate) is not invariant under non-linear transformations of the target. As a starting point, the least-squares framework will be extended to the quantile regression framework which, in contrast to least squares, is compatible with non-linear transformations. From a statistical learning perspective, we shall extend the ERM framework considered thus far to encompass penalized risk minimizations procedures amenable to high dimensional covariates or non-linear regression functions. SVM quantile regression [6] is a natural candidate for this purpose. The goal will be to obtain finite sample guarantees on the generalization error of quantile regression functions learnt with the a subsample made of the largest observations and hopefully recover learning rates of comparable order as the ones obtained in the classical framework, with the full sample size n replaced with the reduced sample size. The bottleneck is that these largest observations may not be considered as an independent sample because they are order statistics of a full sample. However it is anticipated that proof techniques from recent works [7,8,9] based on conditioning arguments and concentration inequalities incorporating (small) variance terms can be leveraged for this purpose.
**References**
[1] Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. L. (2004). Statistics of Extremes: Theory and
Applications, volume 558. John Wiley & Sons.
[2] Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. (2022). Domain generalization: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415.
[3] Jalalzai, H., Clémençon, S., and Sabourin, A. (2018). On binary classification in extreme regions. In
NeurIPS Proceedings, volume 31.
[4] Clémençon, S., Jalalzai, H., Lhaut, S., Sabourin, A., and Segers, J. (2023). Concentration bounds for the
empirical angular measure with statistical learning applications. Bernoulli, 29(4):2797–2827.
[5] Huet, N., Clémençon, S., and Sabourin, A. (2023). On Regression in Extreme Regions. arXiv preprint
arXiv:2303.03084.
[6] Takeuchi, I., Le, Q. V., Sears, T. D., Smola, A. J., and Williams, C. (2006). Nonparametric quantile
estimation. Journal of machine learning research, 7(7).
**Supervisory Team/contact** : Anne Sabourin (MAP5, Université Paris-Cité), Clément Dombry (LMB, Université de Franche-Comté) | Date de début : | Before May 2025 | Durée du contrat : | 4 to 6 months (+ 36 months upon pursuing a PhD) | Secteur d'activité : | Academic research | Description : | context: The internship is intended to lead to a PhD thesis if everything goes as planned. The PhD will be funded by the ANR project EXSTA led by A. Sabourin. The Phd Candidate will benefit from interactions with other researchers in the field e.g. through workshops organised within the project’s framework, in addition to usual participation in conferences. A collaboration is envisioned with Johan Segers (Department of Mathematics, KU Leuwen) on the research questions of the PhD thesis. | En savoir plus : | https://helios2.mi.parisdescartes.fr/~asabouri/index.html#generalInfo offreStage2024.pdf | Contact : | anne.sabourin@u-paris.fr |
|