In many articles published these days, especially in relation to baseball anomalies, we read about "outliers," "extreme observations," and "influential observations." For instance, SBNation claimed that "twenty runs scored in a game is, of course, an outlier" in relation to the Twins game against the Mariners last night. While the terms above sound similar, the three actually have different meanings and can be broken down into further granularity.
An outlier is a data point that diverges from an overall pattern in a sample. It is generally considered to be a data point that is far outside the norm for a variable or population. The value is very different because it could reflect a different process or population. Extreme observations encompass outliers, but also data points called fringeliers, which are scores hovering, as the name suggests, around the fringes of a distribution.
For instance, a fringelier usually lies around plus/minus 3 standard deviations from the mean of a normal distribution, while an outlier could lie around plus/minus 4.5-6 standard deviations from the mean of a normal distribution. All extreme observations, including outliers and fringeliers, are influential observations. In general, an influential point is any point that has a large effect on the slope of a regression line fitting the data. They are generally extreme values, but not always.
A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential.
To test the influence of an outlier or fringelier, one can compute the regression equation with and without the extreme observation. These influential points can seriously bias estimates that be of substantive interest, such as means, standard deviations, and the like. With biased estimates, one is more likely to draw erroneous conclusions.
We can identify outliers and extreme observations by plotting the data, understanding how the data was gathered, or by testing the points against a regression of the data, to see if the line is influenced. Screening data around +/- 3 standard deviations from the mean in combination with a visual inspection of the data is a good place to start. Further in the analysis, one can look at standardized residuals for outliers.
As much as one would like to, it really is not acceptable to drop an observation just because it is an outlier or extreme observation. They can be legitimate observations and are sometimes the most interesting ones. It is important to investigate the nature of the outlier before deciding. It is important to remember that all variance is caused.
If it is obvious that the outlier is due to incorrectly entered or measured data, then you should drop the outlier. If the outlier does not change the results but does affect assumptions, then you can also drop the outlier. More commonly, the outlier affects both the results and assumptions. In this situation, it is not legitimate to simply drop the outlier. One can run the analysis both with and without the extreme observation, as long as you keep track of how the results changed. If the outlier single-handedly creates a significant association, you should drop the outlier and not report any significance from the analysis.
In those cases where one should not drop the outlier, there a few courses of action. One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable. Another option is to try a different model. This should be done with caution, but it may be that a non-linear model fits better.
With our work here, we usually keep outliers in our analyses unless we know for a fact that they were included or induced by error. With the large data sets that we generally work with, outliers get more smoothed out, providing conclusions more valid under these assumptions.
The SaberSmart Team