Name: Random Forests: Why They Work and Why That's a Problem
Start: 2022-10-20T19:30:00
End: 2022-10-20T20:30:00

Submitted by rpc5102 on Thu, 10/13/2022 - 08:06

stat

Random Forests: Why They Work and Why That's a Problem

stat

Random Forests: Why They Work and Why That's a Problem

Presented By

Lucas Mentch (University of Pittsburgh)

Details

Start DateThu, Oct 20, 2022
3:30 PM

End DateThu, Oct 20, 2022
4:30 PM

Location

View larger map

201 Thomas Building, University Park, PA

Add to Calendar 2022-10-20T19:30:00 2022-10-20T20:30:00 UTC Random Forests: Why They Work and Why That's a Problem 201 Thomas Building, University Park, PA

Start DateThu, Oct 20, 2022
3:30 PM

End DateThu, Oct 20, 2022
4:30 PM

Presented By

Lucas Mentch (University of Pittsburgh)

Event Series: Statistics Colloquia

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success, a full and satisfying explanation for their success has yet to be put forth. In this talk, we will show that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. From a model-complexity perspective, this means that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicit regularization procedures like the lasso. Realizing this, we demonstrate that alternative forms of randomness can provide similarly beneficial stabilization. In particular, we show that augmenting the feature space with additional features consisting of only random noise can substantially improve the predictive accuracy of the model. This surprising fact has been largely overlooked within the statistics community, but has crucial implications for thinking about how best to define and measure variable importance. Numerous demonstrations on both real and synthetic data are provided.