Feature Selection for Regression

Antwerp, Belgium - February 16 and 17, 2012

Amsterdam, Netherlands - March 22, 2012

Time: 10.00-16.30


This one-day training is on feature selection and feature importance in data-driven modeling for hard regression problems.

Variable selection is a process of identifying influential variables (features, attributes) in a real or simulated system, that are discriminative and necessary to describe the system's performance characteristics.

Focusing the research (and modeling) on relevant variables reduces the dimensionality of the original problem (by making the problem tractable), shortens time to market (by facilitating insights), improves generalization (by generating robust knowledge), and heavily cuts down the costs for development and deployment of data-driven solutions.

In the current era of "Data, Data Everywhere," computational (or data-driven) modeling and data mining became a necessary and crucial skill demanded by every scientist, researcher, and analyst. An effective data-driven modeling is impossible without proper variable selection, which makes variable selection such a 'hot topic' these days.

Many books are written and are being written on variable selection in machine learning, data mining and artificial intelligence, but the absolute majority of them is focused on deriving relevant variables in classification problems with discrete responses. The problem is that variable selection for regression problems where important control variables are continuous and can take arbitrary numeric values (not necessarily from a fixed number of classes) is significantly harder and is less understood. Our course focuses on the latter problem.

This course (1) describes the most frequently used methods for feature selection in regression-based data-driven modeling, (2) compares methods with each other, (3) analyses their benefits, drawbacks and applicability to problems of increasing complexity.

The following topics will be covered:

1. Data-driven modeling and regression: Challenges and methods

2. Variable Selection and Variable Importance in regression: (Problems with) Definitions and Importance criteria

3. Principle Component Analysis for accessing variable importance: What do we learn from it?

4. Regression Random Forests for feature selection and regression: benefits and disadvantages for problems with correlated features.

5. Symbolic Regression via Genetic Programming for feature selection and regression: benefits and drawbacks.

Our aim is to provide a critical and objective analysis of the feature selection problem for regression, with complicating factors of having noisy, imbalanced data, correlated and coupled variables, and possibly many redundant variables. This is a hands on course. All methods will be illustrated on toy and real-world examples. If you feel like you have an interesting challenging feature selection problem to be used as an example during the course - please contact us (at least one week before the course).

Course price:

Academic participant - EUR 250 (includes course material and lunch);

Non-academic participant - EUR 600 (includes course material, lunch, and an optional 30 minute feedback hands-on web-session one week after the course).

NB: Participants will most benefit from this hands-on course if they bring a laptop and pre-install the needed software prior to the course date.