Amongst the plethora of awards given out at the end of the baseball season, two are batting titles, one naturally per league. This is defined as the hitter in each league that has the highest batting average with at least 3.1 plate appearances per team game, or at least 502 plate appearances for the season. Last year, Jose Altuve and DJ LeMahieu were the batting champions in the American League and National League respectively, but that does not necessarily make them baseball’s most productive hitters.
Batting average (BA) measures the percentage of time that a hitter gets a base hit. Quantitatively, it is defined by the number of hits divided by at bats:
BA = H/AB
A player’s offensive output used to be entirely measured by their batting average, but walks are not included, and home runs count the same as a bloop single when we all know not all hits are created equal. Since batting average never gave weight to extra base hits, and walks have no place in the equation, two new statistics were developed to fix those inconsistencies.
The On-Base Percentage (OBP) essentially is the batting average with walks and hit by pitches added on with hits. It is a measure of how often a batter reaches base for any reason other than a fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference. While on-base percentage gives credit for all ways a batter can reach base, it still weighs them all equally. Its formula is as follows:
OBP = (H+BB+HBP)/(AB+BB+HBP+SF)
The Slugging Percentage (SLG) is a measure of the power of a hitter. It was created as a way to weight hits by the number of bases a player took. For instance, a double is worth twice as much as a single, and a home run is four times as valuable. While slugging percentage weights hits, the weights are arbitrary and not entirely accurate. Additionally, it disproportionately favors sluggers and ignores other ways of reaching base. SLG can be calculated using one of the following formulas:
SLG = ((1*1B)+(2*2B)+(3*3B)+(4*HR))/AB or SLG = (H+2B+(2*3B)+(3*HR))/AB
Although OBP and SLG build off of the weaknesses of the batting average, both still have glaring flaws. To fix this, baseball analysts combined OBP and SLG by adding them together, creating On-base Plus Slugging (OPS). OPS combines the different aspects of hitting into one metric, by taking into account all of the ways a player can reach base and giving weights to the different type of hits. One caveat, OPS is unweighted, meaning it assumes that one percentage point of SLG is the same as that of OBP. In reality, a handy estimate is that OBP is around twice as valuable than SLG.
Qualitatively, we can see that OPS is the best offensive metric when compared to BA, OBP, and SLG individually. However, we can prove quantitatively that OPS is better by using statistics.
For a baseball team to win a game, it needs to score more runs than it allows. Most teams focused on batting average as a statistic to improve their runs scored. We can determine which statistic a team should use by looking at their correlation with runs scored and then making a simple linear regression.
The following analysis uses data from 20 years of baseball statistics from Lahman’s Baseball Database after the player’s strike, so from 1995-2014.
The Pearson correlation coefficient is +1 in the case of a perfect direct linear relationship, -1 in the case of a perfect inverse linear relationship, and some value in between for all other cases. As the correlation value approaches zero, there is less of a relationship. Conversely, the closer the coefficient is to either -1 or 1, the stronger the correlation between the variables.
To compare our linear models, we use the coefficient of determination, R^2, which says how much variance in runs scored can we explain by using the input variables (BA, OBP, SLG, or OPS). The higher this coefficient is, the better our model.
The results are as follows:
All of the individual statistics have a high positive relationship (correlation) with runs scored. However, OPS has the strongest relationship. Additionally, 89% of the variance in runs scored can be explained when using OPS as a predictor, which is the best percentage amongst the four statistics!
Below are two graphs, the first being a team’s runs scored by their batting average while the second is a team’s runs scored by their on-base percentage plus slugging. As you can see, the spread around the regression line decreases, so is tighter to the line, for runs scored by OPS, exemplifying the increase in the R^2 value above.
Clearly, OPS is the better statistic for predicting runs. In the upcoming weeks we will be exploring how and why Billy Beane used OPS to identify and recruit under-valued baseball players for the Oakland Athletics in the early 2000’s. With this strategy, his team could achieve as many wins as teams with more than double the payroll!
What do you think? Should batting average be completely removed from the game? Or is it too deeply ingrained in baseball’s esteemed tradition? Let us know in the comments!
As always, our code can be found on Github.
The SaberSmart Team