Another Perspective on Overfitting

Overfitting is usually explained related to the “bias-variance-tradeoff”. Yet, framing it as an issue of “data vs. information” helps more to explain it.


Intro

After an exhausting period of data cleaning, tedious questioning of suspicious data points and other non-fun tasks, I want my model to work. This moment is a dangerous one. Wanting a model to work makes me vulnerable to a main concern of data scientists: overfitting. It is the technical term for “the model looks better than it is”. The model learned to fulfill your expectations on the data you provided. Yet, it will fail when you try to apply it to unseen data.

Many people worked hard to create tools to mitigate this issue. The specifics depend on the usecase. No matter the details, we as data scientists need time and/or resources to put them in place. To get them, we need clear and concise justifications. The classical line of reasoning is the so-called “bias-variance-tradeoff”. It is important to understand these concepts and I will provide a brief summary in the next section. I also think that there is a better way by referring to the difference between data and information.

“The Bias-Variance-Tradeoff”

Your model has a high bias, if it misses a lot of reality’s mechanics. Given the data you have, your model could be a better representation of what is actually going on. Your model has a high variance, if it is oversensitive towards random noise in your data. The tradeoff emerges for two reasons. First, a straightforward way to reduce bias is to make the model more complex. For instance, you can add interaction terms to your simplistic regression model. Second, an easy way to reduce variance is to force the model to “think” simpler. In other words, you prevent it from learning (too) complicated relationships. For instance, you prevent high variance by pruning a decision tree. In sum, the tradeoff implies that you need to decide on the appropriate level of complexity.

In technical terms, underfitted models have a high bias, whereas overfitted models have a high variance. Every data scientists needs to understand this explanation. Yet, both “bias” and “variance” are terms that are unfamiliar to a non-technical audience or even misleading. For instance, “more variance” sounds like you would account for all the details of a complex business environment. That is why I find an explanation in relation to data vs. information more useful.

Data vs. Information

Data is what emerges from measurable sources. Information is the embedded grain of truth within it. A score in an IQ-test is data, the information on actual intelligence might or might not relate to it. Information should drive our decisions, not data in its raw format. That is the whole purpose of machine learning. More complex approaches promise to uncover more complex information in the data. No matter the data, there is only a finite amount of information contained in it. Humans (and machine learning algorithms applied by them) specialize in pattern detection. Sometimes we overdo it. If you mistake data for information, you overfitted your model. It sees patterns where there are none. For me, this explanation of overfitting is easier to communicate than the bias-variance-tradeoff.

Outro

To be clear, there is no best way to communicate Data Science topics. It depends on the audience (technical vs. muggles), the goal of your talk (presentation vs. lunch) and many other factors. Yet, there is one main advantage about the “data vs. information” approach. It enables a broader audience to appreciate your data science work.

Why to pay for Data Science Trainings

Most of Data Science Resources are online and many of them are free. Why pay?


Intro

I left university with a Ph.D. in sociology. After spending little more than a year in market research, I started my career as a data scientist. Online courses were invaluable for me. Both for clarifying my interests and for developing the necessary set of skills. Most of my experience comes from Udacity, DataCamp and Coursera. “Data Scientist” found its way on my business card, but I am always looking for new things to learn. My next step is to start with the Artificial Intelligence Nanodegree at Udacity. But is such a degree worth the 599€? Every data scientists knows how far you can get with KDnuggets and Stack Overflow. Here is why I still prefer to pay for courses and material.

Structure & Material

Data Science is an multifaceted area. It feels like someone posts a list of “things you need to know as a data scientist” every other day. Especially when you are new to the field, these lists are intimidating. A course removes the uncertainty of deciding what is most important. It also helps you to acquire a skill in a consistent and structured manner. That is not impossible to do on your own but way harder. Courses provide both the “what” and “how” for data science skills for you.

I also found payed-for-material to have an higher quality. Full-time educators create (c.p.) content with higher quality. When education professionals work with experts, the results can be superior. Payed material also tends to have higher relevance, that is you can actually use it in the industry. An example of this is the deep learning specialization by Andrew Ng on Coursera. Udacity’s nanodegrees are great for the same reasons. Payed-for courses also tend to provide a higher variety of material.

Feedback & Support

Feedback is the most important concept in learning, no matter the subject. The lack thereof is a problem, because it is very easy to overestimate one’s own understanding. That is why some of the unpaid material feels like reading a book. It might be a great read, but if you do not find a way to apply it, you are not realizing its full potential.
 
That is why notebooks are extremely valuable. The same is true for quizzes. Yet, the best way of feedback is the opportunity to hand in projects. If there are teaching assistants grade and commented them, as it is the case within the Nanodegree programs by Udacity, your progress accelerates. That is something that is very hard to get without paying for a program.

Certification

There are two main reasons why a official certification is worth the money. First, you can add it to your CV and use it as a signal both for competence and your willingness to learn.  Beyond getting a job, it also helps to get access to the more interesting projects within your company. Second, paying for something increases your commitment. That is especially the case for monthly fees, but also true for high one-time fees. As someone who does a lot of business trips and spends a lot of time in hotels, I am often tempted to shut down my brain after a workday. Knowing that I am paying for a course right now helps me to overcome the initial inertia.

Outro

If you work at a place that supports online trainings and pays at least part of the bills, there is no reason not to engage in payed courses. Define what gap you want to narrow, look for the best program and commit to it. If you have to pay for the course yourself, be honest: are you able to structure your learning process and pull through it? If not, the money spend on payed-for material can be worth every cent.

6 Reasons why Feature Selection is important

You should always care about feature selection. Here are six reasons why.


Intro

High-quality features drive powerful machine learning models. In some cases, our main job as a data scientist is to arrive at these features. This is where feature engineering plays a central role (see this great book about it). A less obvious but at least as important step is feature selection. In this blog post, I list six reasons why. I look at it from three perspectives: workflow, modeling and production.

Workflow 1: focus on actionable features.

Especially partners from the business side stress this argument. Stakeholders focus on actionable results for good reasons. Yet, they also need to know about factors they cannot influence. Ignoring these external factors might backfire. For instance, you cannot change the seasons, but your promotions have to account for them. It is obvious in this case, but can be hard to spot in other cases.

Workflow 2: why exactly do you need access to this data source?

You have to be lucky that the most interesting data is also the one with the easiest access. More often than not, you will have to deal with a question of feasibility. That is, how fast will you get access to which part of the data? Such considerations can be frustrating, but sometimes there is no way around them. In these cases, you have to check accessibility and importance of each data source. I am not calling to focus on the low-hanging fruits only. Someone might have left them there for a reason. Instead, “action beats perfection” is a valid strategy in such situations.

Workflow 3: less features are easier to interpret and implement.

Image a model that accounts for hundreds of features. How do you intent to explain it in PowerPoint? Sounds like a Dilbert comic, but this question contains a large grain of truth. It feels frustrating to compromise for a presentation. Still, it is more frustrating to waste a lot of time and work to miss the mark. Additionally, there might be technical limitations in place. For instance, some web personnalisation tools rely on hard-coded boolean rules. Many features and the resulting complexity prevents adaption in these cases.

Modeling 1: redundancy dilutes a model’s quality.

Think about two correlated features that measure the same underlying information. In this case, your model’s quality will suffer in one way or another. For instance, a regression model will output low levels of significance. As a result, you might miss important insights.

Modeling 2: more features create an increased risk of overfitting.

The easiest way to overfit a model to the training data is to include as many features as possible. As the number of features increases, a model confuses noise with a signal. Your accuracy scores diverge between training and test set? An oversupply of features is a good starting hypothesis.

Production: every feature is costly.

Simplified models (e.g. one with less features) speed up the production process. Of course, you will lose some quality by doing this. Whether that is a substantive problem depends on the use case. For instance, it is important to think about how time-critical a calculation is. Would your customer notice a failing model? Are you sure that the computing resources suffice? In short: what happens if the model fails or takes to long? Depending on the severity, the trade off between speed and quality is easier to decide on.

Outro

This post focuses on the “why” and not the “how” of feature selection. From a technical point of view, there are several ways to reduce the number of features used. One common approach is to use feature extraction techniques like PCA. This approach focuses on keeping most of the underlying information intact. Interpretability decreases as a downside. Pure feature selection techniques (e.g. LASSO) decide on a subset of features. This keeps the interpretability intact, but throws away some of the underlying information. No matter the specific best choice: you should always care about feature selection!

Resources