Overview
Anyone who has had to run an analysis has probably wrestled with the sticky question of what to do with outliers — sticky because there don’t seem to be any real guidelines for if or how they should be trimmed from a dataset. The short answer to the question of whether you should remove outliers from the data? It depends. What can make things even more confusing is that outlier estimation is usually something people have a degree of familiarity with, either through coursework or information imparted by advisers. In published studies, however, authors report testing assumptions of the statistical procedure(s) used in their studies – including checking the presence of outliers – only 8% of the time.”
How are outliers identified?
You see a lot of variability in terms of what an outlier “is” based on what industry or field of study you’re in. Academics in the social sciences tend to look at box-and-whisker plots and use the 1.5x interquartile range (IQR) method. This can be messy to think about so here’s an example:
Suppose you had the following numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The median is the point where half the data are above, and half the data are below. In this case, that’s 5 and 6, or 5.5
Quartile 1: 3
Quartile 3: 8
IQR: Q3 – Q1 = 5
Now that the IQR is defined, tack it onto the sides of Q1 and Q3. :
Lower outlier boundary: Q1- (1.5 x IQR) = 3 – (1.5 *5) = 3-7.5 = -4.5
Upper outlier boundary: Q3+ (1.5 x IQR) = 8 + (1.5 *5) = 8+7.5 = 15.5
Thus, any number that was less than -4.5 or greater than 15.5 in this data set could get the label of an “outlier.” Other popular measures, such as Mahalanobis’ distance or Cook’s D, involve assessing an observation’s “leverage” or “influence” on the data set and then chopping data based in response to the fact that it impacts your model or estimations too much. Some academics instead use the Z = 3 rule to do a first-check of outliers. So, any number that is outside three standard deviation units gets the axe. Outside of academia, some companies prefer to use a Z = 6 cutoff, so anything outside six standard deviations from the mean. This is in part because folks in industry see extreme values in a different way: A marketing campaign that gets an incredible number of responses, clicks, or purchases is something of tremendous worth; it may be that these kinds of “rock-star” marketing techniques are exactly the stuff they’re looking for, so including them in their analyses becomes an important part of understanding top performances. You can see how in academia, this isn’t usually the case. With that in mind, let’s move on to some reasons why you might want to remove outliers in a dataset.
The pros: why you might want to remove outliers
1. The number is clearly an unintentional error. Suppose you’re conducting a survey and you ask people to indicate their hourly income (around $15.00). If someone were to accidentally misread this and happen to put their yearly income, ($30,000), then this would unnecessarily cause all sorts of problems in your analysis. The point would drastically affect means, errors, and the point could be highly influential relative to the rest of your data. Either fix it or chop it from your analysis.
2. The number is an intentional error. There are times when participants purposefully report incorrect data to experimenters. A participant may make a conscious effort to sabotage the research or may be acting from other motives. Social desirability and self-presentation motives can be powerful. This can also happen for obvious reasons when data are sensitive (e.g., teenagers under-reporting drug or alcohol use, misreporting of sexual behavior).
3. The outlier (we think!) comes from a different population. This is by far one of the most common motivations for removing outliers, and is often what graduate students and academics are concerned with. Suppose you’re trying to pilot a summer program that uses simple visual-based puzzles and games increases girls’ interest in computer programming. You recruit a group of 20 girls for the program, and compare the program’s effectiveness to traditional methods. But suppose two girls whom you recruit for the program happen to have mothers who are also computer programmers and who drove a long, long way so their daughters could be part of your summer program. Yikes! You might end up with a result that suggests your program was highly, highly effective. But perhaps it had more to do with the fact that these two girls already have strong role models that support their interest in programming — something highly unusual compared to the rest of the population. You could make an argument that these two girls don’t resemble or behave like most girls in society just because of their highly atypical backgrounds and experiences.
The cons: why you might want to let outliers be
1. The outlier is a legitimate observation from your desired population. Consider the example above with the 2 girls in the summer programming session. As unusual as that may be, the are still individuals who come from the “normal” population. Why, based on principle, would they be excluded then?
2. I chopped an outlier and looked at my data again, and now a new outlier popped up! And so when does it stop? Do you keep chopping data points until you get have a data set with no “outliers?” If you have a small sample, it’s uncertain how long that will be, and by then, you will have surely given the axe to many data points that are in fact perfectly valid. Tweaking observations like this can cause more problems than it’s worth.
3. Making a definitive rule can be hard to defend. Suppose you decide to drop outliers that are more than 3 standard deviations away from the mean. Did you happen to come up with this rule before running your analysis, or did you do a bunch of exploring that allowed you to set cut-offs that were convenient for what you wanted to show? Because there isn’t a 100% accepted way of dealing with outliers, there will always be subjectivity in defining what an outlier is and isn’t, and you’ll need to defend a particular method and convince others that it was the best procedure to be done — not just the procedure that was best for what you hoped to be able to show, as a researcher.
Alternatives to outlier removal
So you’ve decided to leave no observation behind. Good for you! Here are some alternatives to outlier removal researchers often use that are happy middle-grounds to accounting for skewness without worrying about whether your sticky observations in question are or are not part of your intended population.
1. Non-parametric testing. When people say “parametric” testing, they just mean a tradition of hypothesis testing that generally assumes a “normal” distribution — and all of the assumptions and requirements you’ve surely come to know and love. Breaking free of this assumption gives you some more accommodating ways of dealing with data that comes with outlier-prone data. Using rank-based analyses is a good first bet. Take for example household income. Ever notice that when you hear researchers or TV reporters talk about income in the United States (or wherever), they usually talk about the “median” household income? They don’t use the “mean,” because income in the US becomes skewed by America’s millionaires and billionaires. These extreme values have a tremendous impact on the mean, and almost as if through a powerful gravitational force, drag the mean in their direction. To get around this, people often use the “median” (or the middle 50th percentile income). This is a great example of how considering ranks allows you to get around skew. There are non-parametric (often rank-based) testing methods for t-test, correlations, ANOVAs, and more.
2. Transform your data. There are lots of ways of transforming data, but some common methods you might have heard about involve taking the square root or natural log of your Y variables. The idea here is that we’re going to transform the data to look more like a bell-curve or some known distribution, the when we interpret the results, just make the transformation back. The downside obviously is that all this transforming makes it easier to lose sight of what your model or estimates really mean.
3. Use robust estimation. Robust estimation really just means that you’ll be using parameter estimates that are generally OKAY with dealing with skewed data. Depending on the type of test you’re running, this may be a more or less viable option. However, in most statistical packages there are multiple ways of generating parameter estimates, many of which were specifically designed to avoid pitfalls or dodge assumptions that easily “break down” from data that don’t play nice.
Conclusion
Remember, we use outliers as a mathematical indicator for when something is “highly unusual” about our data. But how people define “highly unusual” can vary tremendously. For this reason, there are a lot of different ways of identifying outliers and answering the question of whether they should even be dealt with. In general, most academics argue that unless you have a very compelling reason to drop a particular observation from your data file, tangoing with outliers is more trouble than it’s worth. You (might) get a slightly cleaner data set, but at the expense of losing perhaps meaningful data points and opening your analysis up to additional scrutiny about your outlier criteria. Some viable alternatives are transformations or non-parametric tests that are better at working with highly skewed data.