Overfitting is usually explained related to the “bias-variance-tradeoff”. Yet, framing it as an issue of “data vs. information” helps more to explain it.
After an exhausting period of data cleaning, tedious questioning of suspicious data points and other non-fun tasks, I want my model to work. This moment is a dangerous one. Wanting a model to work makes me vulnerable to a main concern of data scientists: overfitting. It is the technical term for “the model looks better than it is”. The model learned to fulfill your expectations on the data you provided. Yet, it will fail when you try to apply it to unseen data.
Many people worked hard to create tools to mitigate this issue. The specifics depend on the usecase. No matter the details, we as data scientists need time and/or resources to put them in place. To get them, we need clear and concise justifications. The classical line of reasoning is the so-called “bias-variance-tradeoff”. It is important to understand these concepts and I will provide a brief summary in the next section. I also think that there is a better way by referring to the difference between data and information.
Your model has a high bias, if it misses a lot of reality’s mechanics. Given the data you have, your model could be a better representation of what is actually going on. Your model has a high variance, if it is oversensitive towards random noise in your data. The tradeoff emerges for two reasons. First, a straightforward way to reduce bias is to make the model more complex. For instance, you can add interaction terms to your simplistic regression model. Second, an easy way to reduce variance is to force the model to “think” simpler. In other words, you prevent it from learning (too) complicated relationships. For instance, you prevent high variance by pruning a decision tree. In sum, the tradeoff implies that you need to decide on the appropriate level of complexity.
In technical terms, underfitted models have a high bias, whereas overfitted models have a high variance. Every data scientists needs to understand this explanation. Yet, both “bias” and “variance” are terms that are unfamiliar to a non-technical audience or even misleading. For instance, “more variance” sounds like you would account for all the details of a complex business environment. That is why I find an explanation in relation to data vs. information more useful.
Data vs. Information
Data is what emerges from measurable sources. Information is the embedded grain of truth within it. A score in an IQ-test is data, the information on actual intelligence might or might not relate to it. Information should drive our decisions, not data in its raw format. That is the whole purpose of machine learning. More complex approaches promise to uncover more complex information in the data. No matter the data, there is only a finite amount of information contained in it. Humans (and machine learning algorithms applied by them) specialize in pattern detection. Sometimes we overdo it. If you mistake data for information, you overfitted your model. It sees patterns where there are none. For me, this explanation of overfitting is easier to communicate than the bias-variance-tradeoff.
To be clear, there is no best way to communicate Data Science topics. It depends on the audience (technical vs. muggles), the goal of your talk (presentation vs. lunch) and many other factors. Yet, there is one main advantage about the “data vs. information” approach. It enables a broader audience to appreciate your data science work.