This course will lay the fundations of tree-based methods in machine learning. Starting from the general principle of decision trees, a discussion about their advantages and limitations, and their application on real data, we will then move on to popular tree ensemble methods, such as Bagging, random forests and Boosting. The general philosophy will be to give an in-depth presentation of the key mechanisms at stake: the splitting procedure of the trees nodes and the way to aggregate the individual predictions. We will also insist on the tuning parameters and the interpretation tools of those methods: graphical outputs, prediction error estimations, etc. For each method, we will give some examples of their applications using R.

Course 2 - Variable importance measures

One of the main benefits of Random forests, and tree-based ensemble methods in general, is their ability to select the most relevant features, which is a direct consequence of the greedy node splitting strategy. Selected features can furthermore be ranked, and their predictive influence be quantified, using so-called variable importance measures. In this lecture, we will review the most prominent importance measures from the literature, including the mean decrease of impurity, the mean decrease of accuracy, and TreeShap. We will discuss both global and local versions of these measures. We will highlight the main specificities of each measure, establish links between them, and review the main theoretical works in this field. Practicals in python and R will allow participants to experiment with these measures on several illustrative problems.

Course 3 - RF in the supervised learning landscape

There exist now plethora of supervised learning methods available for the practitioner. Despite a clear dominance of deep learning in some applications domain (computer vision or NLP), RF remains a very competitive method for tabular datasets that combines several desirable features: non-parametricity, ease of use, good predictive performance, and computational efficiency. In this lecture, we will discuss the positioning of RF within the supervised learning landscape. We will first highlight the advantages and disadvantages of RF that have emerged from several empirical studies from the machine learning literature. We will then review several works that have highlighted theoretical links between RF and other methods (e.g., kNN or kernel methods) or that have combined RF with other method families (e.g, linear models or deep learning) to improve their performance.

Course 4 - Extensions of RF and Big Data context

Random forests are versatile and used in a variety of contexts. In this course, we will focus on several extensions, uses, or adaptations of random forest. In particular, in the first part of the course, we will describe several strategies that can be used to adapt random forests to the big data framework (when the number of observations is very large) and we will discuss the advantages and drawbacks of these strategies and if they are equivalent or not to the original version of the method. In the second part of the course, other extensions will also be presented, including extensions with a specific focus on extensions suited for time series analysis. This part of the class will be illustrated with applications in R and/or Python, depending on the teacher's mood.

Course 5 - Random survival forests, variance estimation and personalized medicine

This lecture focus on three topics. First, we will introduce some recent theoretical understandings of random forests. This includes asymptotic normalities and variance estimations. In particular, we will discuss the infinitesimal jackknife and U-statistics views of variance estimation and their implementations. Secondly, we will introduce random survival forests, a popular tool for analyzing censored survival data. A confidence band estimation method is proposed to quantify the variation of the estimated survival function. Thirdly, we will demonstrate applications of random forests to personalized medicine, an emerging area in biomedical research, to find a heterogeneous treatment strategy for a given patient. We will give examples of the abovementioned methods in popular computing environments such as R and Python.

Course 1 - Introduction

Course 2 - Variable importance measures

Course 3 - RF in the supervised learning landscape

Course 4 - Extensions of RF and Big Data context

Course 5 - Random survival forests, variance estimation and personalized medicine