A common dream of most baseball aficionados is to visit every ballpark, stepping inside with their own two feet, feeling the history, and seeing every team live. I’m no different; and can already cross off almost two-thirds of the current ballparks off my list (I’m coming for you soon East Coast). After seeing this article on an itinerary that would allow travelers to see every single national park in the 48 contiguous states on a road trip without wasting any time, I wondered how this methodology worked, and how it could be applied to a ballpark journey. As a side note, Randy Olson has also organized the ultimate US road trip and the best cross-Canada journey. I highly recommend him as a follow on Twitter as well!
It turns out that this problem isn’t solved with machine learning, but instead the solution stems from a field of operations research called decision analytics. Finding the shortest route between multiple points is a generalization of what is known as the Travelling Salesman Problem, one of the most intensively studied problems in computational mathematics and operations research. Basically, the travelling salesman problem (TSP) asks the following question:
"Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?"
It is an NP-hard problem in combinatoric optimization. If P ≠ NP, then NP-hard problems cannot be solved in polynomial time, meaning for a large enough number of points, an exact solution could never be found. I bring this up though because it turns out that the aforementioned “If” statement is one of the biggest IFs in mathematics! If YOU can prove that P ≠ NP, then you can win $1,000,000 courtesy of the Clay Mathematics Institute. Only one of the seven so-called Millennium Prize Problems have ever been solved, so if this one seems too hard, you can always try for one of the other five.
Now back to the Travelling Salesman. As you can probably imagine, the TSP is a tough nut to crack since it takes forever to check every possible route to find the shortest one. The problem was first formulated in 1930 and is one of the most intensively studied problems in optimization. Consequently, the best-known methods can be combined to make short work of finding the shortest possible route.
Consider a salesman traveling in the US. The salesman starts in New York and has to visit a set of cities on a business trip before returning home. The problem then consists of finding the shortest tour which visits every city on the itinerary.
We want to minimize the total distance travelled during the tour. Therefore, we simply calculate the distance between each pair of cities. The total distance travelled is thus the sum of the distances of the edges included in the journey.
The problem can be stratified into symmetric and asymmetric subclasses depending on the distance between cities. For example, if two cities, A and B, are connected and the distance traveled between them is equal despite the starting and end points, the problem is said to possess symmetry. However, if the direction of travel between cities A and B yields different distances, the problem is considered asymmetric. The symmetric problem is an undirected graph where the vertices (cities) are connected by a set of edges; the vertices of asymmetric TSP are connected by directed arcs.
Next, we add a constraint that the salesman should enter and leave each city exactly once as well as return to the original starting point.
However, with just this constraint on the number of edges in the journey entering and exiting each city, we may produce solutions that are not connected. In this example below, each city has exactly two paths in the journey entering and exiting it. But there is no path for the salesman to travel between each subjourney. We need to add extra constraints to our model to eliminate these solutions.
A problem with an objective function, or goal, held to enumerated constraints can be modeled as an integer programming model. To summarize this operations research approach then, we build a model for our problem based on the decisions we want to make, the overall objective we have in mind, and restrictions and constraints on what defines a solution. This model is then fed into software that uses complex algorithms to find provably optimal solutions.
So our objective is to visit each ballpark in the country with the shortest total distance traveled, and our restrictions are that we should visit each city only once, returning to our original city, without any sub-journeys.
With acquired data from the Seamheads’ Ballpark Database, powered by the Baseball Gauge, and software installed from Gurobi, implementation in Python was a piece of cake.
Well, there was some complex implementation in how I calculated the distance between cities. The most familiar methods in calculating Euclidean distance are either “straight-line” distance, or the more practical measurement for intra-city travel, Manhattan Distance. “Straight-Line” distance ignores geographic obstacles, while Manhattan’s Distance is constrained to obstacles like roads and buildings.
Since the earth is not flat and is rather an oblate ellipsoid, using any of the two-dimensional distance calculations would suggest a path through the Earth. This is obviously not logical.
The two main methods to calculate distance between two points on the Earth’s surface are by using the geodesic distance or the great-circle distance. The Great Circle distance assumes that the Earth is spherical, which results in a .05% error in the distance calculated due to the Earth being slightly elliptical due to its rotation. I chose to use the geodesic distance, which is more accurate, yet is a “as-the-crow-flies” distance.
The first iteration of the model considered only the 30 active ballparks. For perspective, here is where they are all located in the United States:
The optimal path discovered by the algorithm is shown below and can start from any position in the order:
The optimal geodesic distance traveled was 8,962.77 miles and it seems to follow a path that follows nation’s border. The algorithm did a respectable job configuring the optimal path to include ballparks in states inside this path like Missouri, Minnesota, and Colorado. Similarly, the algorithm accurately mapped clusters in regions like the Northeast where ballparks in New York, Boston, Philadelphia, and D.C., were in close proximity.
Next, we ran the algorithm against all 253 historical ballpark locations that have ever hosted an MLB game from around the globe. This included the series played in Mexico and Puerto Rico this year, as well as previous games played in Tokyo, Sidney, and even Honolulu!
Here are the locations on the world map of all 253 stadiums used in this part of the analysis:
The algorithm discovered a mostly optimized route. The optimal geodesic distance traveled would be 29,298.9 miles. This history of the sport is apparent here, as 32% of the locations were found in the Northeast. This concentration of nodes likely demanded the most work for the program. The algorithm seems to have found the optimal solution regarding international travel, the longest legs of the path. These routes include flying from Hawaii, to Australia, to Japan, while returning to Seattle. Further, the optimal route accommodates Puerto Rico among a cluster of Florida ballparks and includes Mexico between ballparks in Texas and Arizona.
In comparison to the 30-node solution path, locations in Florida were adjacent to Texas; however, in the 253-node solution, Florida locations are adjacent to Georgia and Kentucky. This very well could be the optimal solution, but since the algorithm failed to converge completely, we cannot say for certain a more optimal path does not exist.
The algorithm dedicated a lot of effort testing each ballpark in clusters with every other node in the problem, but in terms of the optimal path these ballparks within the clusters should always connect to each other in some order. The intra-city ballpark paths then could be where the error in the optimality gap lies.
As you can see by this table, the length of time that the program ran for the historical ballparks is proportionate to its optimality gap. Gurobi defines convergence as an optimality gap at essentially zero.
Even if this isn’t the optimal solution, it looks pretty darn close. As XKCD poignantly observed, the quickest way to visit every ballpark is just to visit each virtually from the comfort of your couch, perhaps by watching a game.
I hope this was an interesting introduction to the Travelling Salesman Problem and time complexity! Do you agree with the order of both roadtrips the algorithm proposed? How many ballparks have you visited? Let us know in the comments below! As always, our code can be found on Github.
The SaberSmart Team
P.S. If you enjoyed this article, and need something off Amazon anyway, why not support this site by clicking through the banner at the bottom of the page? As a member of the Amazon Affiliates program, we may receive a commission on any purchases. All revenue goes towards continued hosting of this site.