In today’s technologically dependent and interconnected world, data can come in many forms. There are structured data as we encounter in the numeric fields of a database. There are semi-structured data and unstructured data, as we encounter in text files and web interaction. In fact, unstructured data is the most common, lurking in places where data is not regularly deemed to exist. Current analytical work requires extensive time spent putting data into a structured form and preparing it for analysis. Consequently, being able to understand the different data types is vital for analytical success.
Amongst the plethora of awards given out at the end of the baseball season, two are batting titles, one naturally per league. This is defined as the hitter in each league that has the highest batting average with at least 3.1 plate appearances per team game, or at least 502 plate appearances for the season. Last year, Jose Altuve and DJ LeMahieu were the batting champions in the American League and National League respectively, but that does not necessarily make them baseball’s most productive hitters.
Batting average (BA) measures the percentage of time that a hitter gets a base hit. Quantitatively, it is defined by the number of hits divided by at bats:
In many articles published these days, especially in relation to baseball anomalies, we read about "outliers," "extreme observations," and "influential observations." For instance, SBNation claimed that "twenty runs scored in a game is, of course, an outlier" in relation to the Twins game against the Mariners last night. While the terms above sound similar, the three actually have different meanings and can be broken down into further granularity.
An outlier is a data point that diverges from an overall pattern in a sample. It is generally considered to be a data point that is far outside the norm for a variable or population. The value is very different because it could reflect a different process or population. Extreme observations encompass outliers, but also data points called fringeliers, which are scores hovering, as the name suggests, around the fringes of a distribution.
Statistical significance is not the same as material significance, or importance.
In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred under the null hypothesis, or is larger than we would expect by chance. The result is statistically significant, at least in terms of the study, when the calculated p-value, p is less than the significance level, α. In other words, statistical significance reflects the low probability that an observed data has been arrived at by chance.