As the dog days of summer continue to roll on, all thirty baseball teams have finally reached the informal halfway point of the season, and get a well-deserved break as the country focuses on the All-Star Game and Home Run Derby. As such, I decided to see how my end of season forecasts have changed based on all of this new data that is available.
Before we start, in case you missed it or need a refresher on the topic and my results, check out my first installment where I predicted end of season records based on a “small” sample size of the first three weeks of the season. I am using the same Bayesian methodology, except with this newly gathered data, our posterior will rely less heavily on our prior expectations.
To briefly summarize my last article on this topic, the beautiful thing about baseball is that most things tend to even out to their natural state by the end of the season. While luck definitely factors into the results for individual seasons, a phenomenon called Regression Towards the Mean takes a hold of most statistics in a baseball season. Simply put, in the long run, most things average out. If baseball wins tend to average out over the course of a season, then what number should we regress to? E.g. how much should we regress by?
The answer to this is quite complicated, and I delved into the math in my last piece on the subject. Essentially, the number is ~71 games when using data from the 2017 season, which you then multiply by your prior, assumed, winning percentage and then add to a team’s current record. For example, if you expect the Giants to regress to a .500 record by the end of the season, we would multiply 71 by .500, for a record of 35-35, add that to their current record of 50-48, 85-83, get that as a winning percentage, .506, then multiply by 162 to get an end-of-season regressed record of 82-80. Thus, we would expect the Giants to go 32-32 in the second half. For further reading, I recommend Tom’s blog, Phil’s blog, 3-D baseball, and FiveThirtyEight. I think the Hardball Times also has a couple articles on it too.
While .500 is a decent naive prior winning percentage for the Giants, for the Red Sox or Royals this is vastly misleading. Instead, I again used the 2018 projected wins and winning percentage from OddsShark to define a unique prior for each team. After I calculated the newly regressed records, I compared them to my estimates from April 18. This allows me to see which teams have recovered from slow starts, or on the other hand, completely collapsed *cough METS cough*. Here are the results, sorted by the difference in regressed wins:
Here is a scatterplot of the above table, with the April 18 regressed wins on the x-axis and the July 16 regressed wins on the y-axis. The closer to the dashed line a point is, the smaller the change in expected regressed wins. The points below the line have “collapsed” while those above have improved from their April regressed wins.
To show just how powerful this method is, the average of the difference in wins for all teams is zero. Furthermore, 26 of the 30 teams had their expected number of regressed wins change by ±7 or better. In fact, for half of the teams, their expected end of season changed by ±3 games! To put another way, after gathering an additional 3 months of data, our end of season expectations hardly changed from our estimates after 3 weeks of data for half of the teams.
The below slideshow depicts the beta approximations from April and July for the end of season winning percentage for a selection of teams where I found the difference most stark. As expected, the beta approximations based off of the July 16 data are narrower, leading to a more condensed confidence interval. As data continues to be gathered, these will continue to narrow. The Texas Rangers’ expected wins hardly changed, while on the other hand, the Reds have recovered from their slow start in April and the Mets completely collapsed from their April record. Interestingly, the Oakland A’s and Seattle Mariners have also improved their expected wins by a significant amount. This is because our model is now regressing based more off of the data that has been gathered, rather than the prior, or OddsShark's pre-season expectation.
Unfortunately, teams like the Blue Jays, Angels, Baltimore, and the Royals have decreased their expected wins the most. Again, this is due to the beta approximation weighing the data collected more than the prior. Even in April, when Baltimore and the Royals had two of the worst records in baseball, the prior from OddsShark hinted that there was no ways they could continue to be this terrible. However, it turns out they are this atrocious, as neither have hit 30 wins yet this season.
I find the comparison between the Reds and the Mets to portray the best juxtaposition. Both now have about the same beta approximation for their end of season records, however the Reds increased theirs from April while the Mets completely collapsed.
Naturally, I then compared their regressed wins to the pre-season expectations by OddsShark, the numbers I used as my prior. The table below is sorted by the difference in the regressed wins and pre-season OddsShark expected wins:
This chart explains what I was trying to say above about the over-reliance the beta approximation has on the prior in the face of little gathered evidence. OddsShark had the Royals and Orioles doing much better than what we have seen so far this season, however since our model is now weighing that data seen more than the prior, we see the discrepancy in the large difference in regressed wins and the OddsShark predictions. The same is true for teams like Oakland, Seattle, Atlanta, and the Phillies. Boston is just ridiculous. OddsShark expected them to win a decent 92 games, but I expect them to win around 104(!) games by the end of the season.
The scatterplot below is formatted similar to the one above. The closer to the line a point is, the closer the regressed wins are to the pre-season OddsShark predictions. Above the line means a team is expected to outperform their OddsShark prediction based on their regressed wins, and vice versa for below the line.
We can see from this graph a similar story of the one above. Based on the first three months of play, based on their regressed end-of-season records, I expect that OddsShark undervalued the Red Sox, Oakland, Phillies, Mariners and Braves, while overvalued the Orioles, Royals, White Sox, Mets and Nationals.
Finally, I examined the results of the July regressed wins with their confidence intervals and probability of winning/losing 100 games.
To my surprise, 6 teams still have at least a 50% chance of winning or losing 100 games! The Red Sox lead the pack with an insane 73.5% of winning at least 100 games according to my model, followed by the Astros at 59.6% and Yankees at 57.4%. The Royals lead the other end of the spectrum, with a ridiculous 80.4% chance of losing 100 games, followed by the O’s at 75.8% and the White Sox at 55.6%. Again, since these are probabilities, the odds that all 6 teams win/lose 100 games is a paltry 8.5% since they have to be multiplied. However, it is looking likely that we have the first season with two 100 loss teams since 2013.
And of course, here are the playoff rankings from the above estimates. Will Seattle finally break their postseason drought?!
AL Division Winners:
Boston, Houston, Cleveland
Yankees, Seattle (!)
NL Division Winners:
Dodgers, Chicago, Nationals
NL Wild Card:
The American League already seems to be all but wrapped up. Based on these projections, the closest team to the second wildcard would be Oakland, but by around 5 games. While only one of the Red Sox or Yankees will win the division, the other is essentially guaranteed to host the wildcard game.
The National League is a different story. The NL East is a mess, and under-performance by the Dodgers with over-performances by Arizona and the Brewers are providing for tight races in the Central and West.
Here are the regressed wins for the NL East division leaders:
It is well within the margin of error, the 90% CI, for those to change and reorder themselves in any manner of ways. Or the Nats will finally get a grip and win the division by 6 games because baseball.
The Rockies, Cardinals, and the 2nd and 3rd place winners of the NL East all still have a shot of shaking up the wildcard picture as well.
In conclusion, this could be a historic season for the AL. While there have been 23 seasons with two or more 100-win teams, only six seasons have seen three teams finish above the 100-win threshold in the same season, and never more than two in a single league. The American League could change that this year, albeit with a probability right now of only 25%.
While this does cause the AL to be less competitive, the tight NL races more than makes up for it.
What do you think? Is your team still in the running for a pennant? How will the NL shake out? Let me know in the comments below! As always, our code, data, and predictions can be found on Github.
The SaberSmart Team
P.S. If you enjoyed this article, and need something off Amazon anyway, why not support this site by clicking through the banner at the bottom of the page? As a member of the Amazon Affiliates program, we may receive a commission on any purchases. All revenue goes towards the continued hosting of this site.