The Major League Baseball season has only been in action for around a month, yet some fans have already declared their team’s season as “over”. While this may be true for the Miami Marlins, how can we use statistics to see how a team’s start will affect their overall season? Many pundits simply say a baseball team is on pace to win and lose a certain number of games by simply multiplying a team’s current winning percentage over 162 games.
For example, here’s a silly article by CBS Sports that will be totally irrelevant come October: More than half of MLB teams are on pace to win or lose 100 games in 2018. That article was the inspiration behind this post; if you are going to make outlandish claims, at least use math to support it.
I won’t spoil for you how many teams I think will win or lose 100 games by the end of the season, yet, but I can assure you that it won’t be 16. Luckily, most competent announcers and commentators qualify these early season predictions with a statement along the lines of “although it’s a small sample size”. While this is true, you can’t really just say that then not take it into account with your math, especially if you are just using it as clickbait to get people to read about your thoughts on tanking.
The beautiful thing about baseball is that most things tend to even out to their natural state by the end of the season. While luck definitely factors into the results for individual seasons, a phenomenon called Regression Towards the Mean takes a hold of most statistics in a baseball season. Simply put, in the long run, most things average out. This phenomenon is also why the Hot Hand is a fallacy, however, I ranted about that topic in a previous post, which wow is a year old already. If baseball wins tend to average out over the course of a season, then what number should we regress to? E.g. how much should we regress by? I just want to clarify that all records and standings used below were as of the end of April 24, 2018.
A Starting Point
A naive approach, but a good starting point, would be to assume that a team’s winning percentage will regress towards .500 as the season goes on. Let’s call this our prior probability. A prior probability is the winning percentage we would expect without seeing any new data. In this case, we basically assume that before a single game was played, we believe that a team will have a .500 winning percentage. If you remember anything from my article on predicting the 2018 Super Bowl, determining a prior probability is the first step in performing some Bayesian analysis.
To get to a .500 winning percentage over the course of a 162 game season, a team would have to go 81-81. One way to regress towards this mean, then, would be to simply add 81-81 to a team’s current win-loss record and calculate the regressed winning percentage (wp%). For example, Boston has currently gone 17-5 for a wp% of 0.773, but their regressed wp% would be 0.532, or a record of 86-76. However, it has been mathematically proven that adding 81-81 is actually too large and gives too much weight to the prior probability. The next bit is a bit mathy, so feel free to skim past it if you want. If you want to tackle it though, read this intro to RTM by FanGraphs.
*CAUTION MATH AHEAD*
If we assume that talent and luck are independent, and that a team’s actual performance is a combination of the two, we can define the following equation, where var() is the variance:
var(actual performance (wp%)) = var(talent) + var(luck)
Let’s tackle var(luck) first. If we assume that each game has a p probability of being won, and q probability of being lost, we can model the season as a series of coin flips, or a binomial distribution. By using the normal approximation of the binomial distribution, since our sample is a large 162 games, we get a simple equation for the standard deviation of luck :
sqrt(var(luck)) = sd(luck) = sqrt((p*q)/g) where g is the number of games.
Plugging in our numbers, sd(luck) = sqrt((.5*.5)/162) = .03928, or around 6.5 games a season.
To calculate the variance of actual performance, I calculated the standard deviation of the winning percentage of teams from the 2017 season:
sqrt(var(actual performance)) = sd(2017 wp%) = .07128.
We can then find var(talent) = .07128^2 - .03928^2 = .00353, and sd(talent) = .0595.
The number of games to regress by then can be found where sd(luck) = sd(talent).
sqrt((p*q)/g) = .0595, g = (.5*.5)/.00353, g ~= 70.6 games.
This means we can add 70.6*p to our wins and 70.6*q to our losses to calculate our new regressed record and winning percentage. This also works for any record at any time. Suppose your team starts 2-6. Adjust it (add 35.3) to get 37-41. That is an estimated talent of .440 wp%, or a 71-91 record.
*END OF MATHY STUFF*
This should end the mathy stuff for now. Again, I did not develop the above method to calculate the number of games needed to regress, that would go to the incomparable Tom Tango. I merely used an updated data set to measure actual performance variance.
For further reading, I recommend Tom’s blog, Phil’s blog, 3-D baseball, and FiveThirtyEight. I think the Hardball Times also has a couple articles on it too.
This regressed record can be used as our Bayesian posterior probability, or updated belief based on the data we have seen. We can use these probabilities to create a distribution, to answer questions such as, my team started 7-11, what is the probability we lose a 100 games? Or, my team started 17-5, what is the probability we win 100 games?
To begin with, here is our prior distribution, modeled by the Beta distribution. I then sampled from it 10,000 times to create a histogram. As you can probably tell, the mean and median winning percentage is .500, as expected.
The Texas Rangers are currently 8-17. If we assume that they should regress towards our prior distribution of .500 as the season goes on, we can calculate the posterior distribution as simply
Beta(8 + 35.3, 17 + 35.3). Here is that posterior distribution graphed:
Their new estimated winning percentage is 0.452, good for a 73-89 record. While the Rangers’ current record affects our beliefs, our prior still dominates as we know a .320 win percentage is unlikely to hold over an entire season. As we see more data, our posterior distribution will narrow onto the true record as our prior belief is diminished in relevance.
With this distribution, we can actually calculate probabilities to various questions. For instance, what is the probability that the Texas Rangers will turn it around and win 100 games? Well, 0%. That would require a winning percentage of .6127 for the season which lies on the far left tail of our distribution. What about finishing at .500 or greater? That has a slightly higher probability of 18%. Finally, the probability of the Rangers losing 100 games or more is a mere 8.3%. It really is hard to win or lose 100 games!
The more baseball inclined among you may have realized that expecting every team to regress towards a .500 record is nonsensical. Nobody really expects the Astros or Dodgers to continue to hover around .500 for long, and the Red Sox definitely seem like a much better than average team. Similarly, the Marlins and White Sox won’t come anywhere near .500 this season. Instead of using p=q=.5o, we can use the 2018 projected wins and winning percentage from OddsShark.
For example, the Texas Rangers were projected to win 77.5 games, for a projected win percentage of 0.478 wp%. This is our new p, and q = 0.522. Using this, we find a new game number to add to their record to create our posterior distribution, in this case the posterior is now modeled by Beta(8 + 33.7, 17 + 36.75). Here is that distribution:
The expected winning percentage is a 0.436, for a record of 71-91. This makes sense as being worse than before because OddsShark expect the Rangers to regress to a worse record than 81-81. The probability of them losing a 100 games or more increases to 14.35%, and the chance of the Rangers crawling back up to .500 or better falls to 10.7%. Their 95% CI for winning percentage is a quite broad [35.4, 51.9]. As more data is collected over the season, this should narrow.
Finally, I calculated a personalized prior for every team and used it to project their record at the end of the season using the methods described above. It should open bigger in a new tab. The results are also available on my Github. Here is that table:
No big surprises here. Miami has the highest probability of losing a 100 games, with a ridiculous 65.7%, although the White Sox, Red, and Royals all have a greater than 40% chance as well. The only two team with a chance at 100 wins are Boston and Houston. When I compared this to the FanGraphs projections for the rest of the season, the only major discrepancy was the Diamondbacks. I project them to go 93-69 right now while Fangraphs has a more conservative 86-76 record. Also, the NL East could go either way between the Mets and Nationals, due to the awful slow start from the Nats. Again, the confidence intervals are WIDE open right now.
If I was a betting man though, I would say that only Miami will lose 100 games, and no other team will lose or win that many. The Marlins are the only team "on pace" to win/lose 100 games, where "on pace" is defined as a probability greater than 0.5. Take that CBS.
Fine, here are the playoff rankings from the above estimates. I'm sure no caveat about small sample sizes or large confidence intervals will placate some of you though. Sorry in advance, Dodgers, Nats, or Cardinals fans.
AL Division Winners:
Boston, Houston, Cleveland
NL Division Winners:
Arizona, Chicago, Mets (!), with the Nationals three games out.
NL Wild Card:
Dodgers, Milwaukee, with St. Louis one game out.
What do you think about these predictions or the small sample size mumbo-jumbo? How serious do you think the tanking problem is? Let me know in the comments below! As always, all data and code can be found on my GitHub.
The SaberSmart Team