Feature Selection from High Dimensions

Abstract: It is a known challenge to select features and in its current state is more of an art than science as the approach can differ depending on the problem the data scientist is looking to solve. While there are methods such as regularization, recursive feature selection, or automated processes like Boruta, these models all perform differently depending on the type of algorithm used in the training process. With the rise of larger ensembles using a wide range of underlying models, using a singular feature selection process can lead to an underperforming model. This problem is especially prevalent with the rise of automated machine learning platforms using a variety of base models.

Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.

I am proposing and demonstrating a feature selection algorithm in a similar spirit to Boruta utilizing XGBoost as the base model. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets, including one of nearly twenty-two thousand features. These results hold up across a number of UCI Repository datasets. Evaluation results and timings will be shared along with the underlying code to be later converted into a library posted on CRAN.

Bio: Chase is currently a Data Scientist at Progressive Leasing in Draper, Utah working on variety of cool projects. Prior to the current position, he was an Assistant Professor of Finance and Economics at the University of South Carolina Upstate and holds a BS, MS, and PhD, all in Economics, from the University of Utah.

Open Data Science Conference