You should always care about feature selection. Here are six reasons why.
High-quality features drive powerful machine learning models. In some cases, our main job as a data scientist is to arrive at these features. This is where feature engineering plays a central role (see this great book about it). A less obvious but at least as important step is feature selection. In this blog post, I list six reasons why. I look at it from three perspectives: workflow, modeling and production.
Workflow 1: focus on actionable features.
Workflow 2: why exactly do you need access to this data source?
You have to be lucky that the most interesting data is also the one with the easiest access. More often than not, you will have to deal with a question of feasibility. That is, how fast will you get access to which part of the data? Such considerations can be frustrating, but sometimes there is no way around them. In these cases, you have to check accessibility and importance of each data source. I am not calling to focus on the low-hanging fruits only. Someone might have left them there for a reason. Instead, “action beats perfection” is a valid strategy in such situations.
Workflow 3: less features are easier to interpret and implement.
Image a model that accounts for hundreds of features. How do you intent to explain it in PowerPoint? Sounds like a Dilbert comic, but this question contains a large grain of truth. It feels frustrating to compromise for a presentation. Still, it is more frustrating to waste a lot of time and work to miss the mark. Additionally, there might be technical limitations in place. For instance, some web personnalisation tools rely on hard-coded boolean rules. Many features and the resulting complexity prevents adaption in these cases.
Modeling 1: redundancy dilutes a model’s quality.
Think about two correlated features that measure the same underlying information. In this case, your model’s quality will suffer in one way or another. For instance, a regression model will output low levels of significance. As a result, you might miss important insights.
Modeling 2: more features create an increased risk of overfitting.
The easiest way to overfit a model to the training data is to include as many features as possible. As the number of features increases, a model confuses noise with a signal. Your accuracy scores diverge between training and test set? An oversupply of features is a good starting hypothesis.
Production: every feature is costly.
Simplified models (e.g. one with less features) speed up the production process. Of course, you will lose some quality by doing this. Whether that is a substantive problem depends on the use case. For instance, it is important to think about how time-critical a calculation is. Would your customer notice a failing model? Are you sure that the computing resources suffice? In short: what happens if the model fails or takes to long? Depending on the severity, the trade off between speed and quality is easier to decide on.