Why is extrapolation dangerous in statistics




















I remember sitting in stats courses as an undergrad hearing about why extrapolation was a bad idea. Furthermore, there are a variety of sources online which comment on this. There's also a mention of it here.

Can anyone help me understand why extrapolation is a bad idea? If it is, how is it that forecasting techniques aren't statistically invalid? A regression model is often used for extrapolation, i. The danger associated with extrapolation is illustrated in the following figure.

Using the data points Cueball the man with the stick has, he has extrapolated that the woman will have "four dozen" husbands by late next month, and used this extrapolation to lead to the conclusion of buying the wedding cake in bulk.

Edit 3: For those of you who say "he doesn't have enough data points", here's another xkcd comic :. Here, the usage of the word "sustainable" over time is shown on a semi-log plot, and extrapolating the data points we receive an unreasonable estimates of how often the word "sustainable" will occur in the future. Edit 2: For those of you who say "you need all past data points too", yet another xkcd comic:. Here, we have all past data points but we fail to accurately predict the resolution of Google Earth.

Note that this is a semi-log graph too. If you extrapolate without other supporting evidence you also violating correlation does not imply causation ; another great sin in the world of statistics.

If you do extrapolate X with Y, however, you must make sure that you can accurately enough to satisfy your requirements predict X with only Y. Almost always, there are multiple factors than impact X. I would like to share a link to another answer that explains it in the words of Nassim Nicholas Taleb. The quote is attributed to many people in some form.

I restrict in the following "extrapolation" to "prediction outside the known range", and in a one-dimensional setting, extrapolation from a known past to an unknown future. So what is wrong with extrapolation. First, it is not easy to model the past.

Second, it is hard to know whether a model from the past can be used for the future. Behind both assertions dwell deep questions about causality or ergodicity , sufficiency of explanatory variables, etc. What is wrong is that it is difficult to choose an single extrapolation scheme that works fine in different contexts, without a lot of extra information.

This generic mismatch is clearly illustrated in the Anscombe quartet dataset shown below. The same line regresses four set of points, with the same standard statistics. However, the underlying models are quite different: the first one is quite standard. The second is a parametric model error a second or third degree polynomial could be better suited , the third shows a perfect fit except for one value outlier?

However, forecasting can be rectified to some extend. Adding to other answers, a couple of ingredients can help practical extrapolation:.

Recently, I have been involved in a project for extrapolating values for the communication of simulation subsystems in a real-time environment. The dogma in this domain was that extrapolation may cause instability. We actually realized that combining the two above ingredients was very efficient, without noticeable instability without a formal proof yet: CHOPtrey: contextual online polynomial extrapolation for enhanced multi-core co-simulation of complex systems , Simulation, And the extrapolation worked with simple polynomials, with a very low computational burden, most of the operations being computed beforehand and stored in look-up tables.

Finally, as extrapolation suggests funny drawings, the following is the backward effect of linear regression:. Although the fit of a model might be " good ", extrapolation beyond the range of the data must be treated skeptically. The reason is that in many cases extrapolation unfortunately and unavoidably relies on untestable assumptions about the behaviour of the data beyond their observed support. When extrapolating one must do two judgement calls: First, from a quantitative perspective, how valid is the model outside the range of the data?

Because both questions entail a certain degree of ambiguity extrapolation is considered an ambiguous technique too. If you have reasons to accept that these assumptions hold, then extrapolation is usually a valid inferential procedure.

An additional caveat is that many non-parametric estimation techniques do not permit extrapolation natively. This problem is particularly noticeable in the case of spline smoothing where there are no more knots to anchor the fitted spline. Let me stress that extrapolation is far from evil. For example, numerical methods widely used in Statistics for example Aitken's delta-squared process and Richardson's Extrapolation are essentially extrapolation schemes based on the idea that the underlying behaviour of the function analysed for the observed data remains stable across the function's support.

Contrary to other answers, I'd say that there is nothing wrong with extrapolation as far as it is not used in mindless way. First, notice that extrapolation is :. In fact, extrapolation, prediction and forecast are closely related. In statistics we often make predictions and forecasts. This is also what the link you refer to says:. Many extrapolation methods are used for making predictions, moreover, often some simple methods work pretty well with small samples, so can be preferred then the complicated ones.

The problem is, as noticed in other answers, when you use extrapolation method improperly. For example, many studies show that the age of sexual initiation decreases over time in western countries. Take a look at a plot below about age of first intercourse in the US.

If we blindly used linear regression to predict age of first intercourse we would predict it to go below zero at some number of years accordingly with first marriage and first birth happening at some time after death However, if you needed to make one-year-ahead forecast, then I'd guess that linear regression would lead to pretty accurate short term predictions for the trend.

Another great example comes from completely different domain, since it is about "extrapolating" for test done by Microsoft Excel, as shown below I don't know if this is already fixed or not. I don't know the author of this image, it comes from Giphy.

All models are wrong , extrapolation is also wrong, since it wouldn't enable you to make precise predictions. The extent of how accurate they will be depends on quality of the data that you have, using methods adequate for your problem, the assumptions you made while defining your model and many other factors.

But this doesn't mean that we can't use such methods. We can, but we need to remember about their limitations and should assess their quality for a given problem. I quite like the example by Nassim Taleb which was an adaptation of an earlier example by Bertrand Russell :.

Odit molestiae mollitia laudantium assumenda nam eaque, excepturi, soluta, perspiciatis cupiditate sapiente, adipisci quaerat odio voluptates consectetur nulla eveniet iure vitae quibusdam? Excepturi aliquam in iure, repellat, fugiat illum voluptate repellendus blanditiis veritatis ducimus ad ipsa quisquam, commodi vel necessitatibus, harum quos a dignissimos. Close Save changes. Help F1 or? Save changes Close. Regression analysis is widely used for prediction and forecasting.

Regression analysis is also used to understand which among the independent variables is related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables.

However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation. Performing extrapolation relies strongly on the regression assumptions. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.

It is generally advised that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable s move outside the range covered by the observed data. A properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data, but it can only do so within the range of values of the independent variables actually available.

This means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship. Best-practice advice here is that a linear-in-variables and linear-in-parameters relationship should not be chosen simply for computational convenience, but that all available knowledge should be deployed in constructing a regression model.

If this knowledge includes the fact that the dependent variable cannot go outside a certain range of values, this can be made use of in selecting the model — even if the observed data set has no values particularly near such bounds.

The implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered.

Here are the required conditions for the regression model:. The importance of data distribution in linear regression inference : A good rule of thumb when using the linear regression method is to look at the scatter plot of the data. This graph is a visual example of why it is important that the data have a linear relationship. Each of these four data sets has the same linear regression line and therefore the same correlation, 0.

This number may at first seem like a strong correlation—but in reality the four data distributions are very different: the same predictions that might be true for the first data set would likely not be true for the second, even though the regression method would lead you to believe that they were more or less the same. Looking at panels 2, 3, and 4, you can see that a straight line is probably not the best way to represent these three data sets.

A graph of averages and the least-square regression line are both good ways to summarize the data in a scatterplot. Linear straight-line relationships between two quantitative variables are very common in statistics. This can be done by drawing a line through the scatterplot. The line is a model that can be used to make predictions, whether it is interpolation or extrapolation. In most cases, a line will not pass through all points in the data.

A good line of regression makes the distances from the points to the line as small as possible. The points on a graph of averages do not usually line up in a straight line, making it different from the least-squares regression line. Least Squares Regression Line : Random data points and their linear regression.



0コメント

  • 1000 / 1000