SaberSmart
  • Home
  • Blog
    • Throwback
  • Playoff Odds
    • MLB >
      • 2019 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2018 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
    • NBA >
      • 2018 >
        • Total Playoff
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Expected Wins
  • About
    • Contact
    • Comment Policy

Data Visualization: Or How I Learned to Stop Rambling and Love The Graph

5/1/2017

Comments

 
Picture
Data can provide valuable insights through statistics and other methods of exploratory data analysis. However, there will indubitably come a time when successful communication of these findings become necessary. Graphs and charts provide a great way to communicate data, as well as a method to provide awareness on the stories they can tell. Unfortunately, there are too many incidents of graphs misrepresenting data and consequent conclusions. ​
Luckily, as Rahul Dodhia from Raven Analytics, the author of Misuse of Statistics, states, “statistical literacy is not a skill that is widely accepted as necessary in education. Therefore a lot of misuse of statistics is not intentional, just uninformed”. This means with the right knowledge, people can analyze graphics and charts appropriately, determining themselves if they are misleading.

While there are only a few rules dictating what makes a good graphic, as defined by Edward Tufte in his book, The Visual Display of Quantitative Information, the one rule most commonly broken says simply: “Avoid distorting the data”! Maybe the most blatant case of this is when graphs gratuitously include a third dimension, when one is not needed.  The general rule of thumb is to only use as many dimensions as the data one is trying to show. Here is one example that portrays this distortion:
Picture

​The only dimensions in the data that this graph is trying to show are payments by year and airline carrier. Thus only two dimensions should be used. Additionally, the striped bar only shows 6 months worth of data! In this case, this data should not be included as it is a false comparison against the yearly payments. It would be too easy for false conclusions to be drawn.


A second, poorly designed, graph that drastically distorts the data shows the Fuel Economy Standard for automobiles from 1978 to 1985 in Miles Per Gallon. Here is that image:
Picture

​As you can probably see, the graphic tried to be clever by making its appearance take the shape of a road. However, to draw a road on a flat piece of paper requires some distortion of perception, as the road fades into a singular point in the distance. When the tick marks in the middle of the road are supposed to depict the data, then they too inevitably get skewed. Here the tick from 1978-1979 shows an increase in 1 mpg but is almost 1/100th the size of the tick mark from 1984-1985 that only increases by ½ mpg!


Poor design, but good effort for making the chart look like a road. We felt that a barchart (what the graph we think was going for) without any distortion of course, would be the best visualization of the data. Below is what we came up with. As you can see, without any flair or adding a road background, the data is much more readable and comparable. Like Occam's Razor says, sometimes less is better!
Picture
​All of this then finally leads us to what is commonly thought of, at least by Tufte, as “the worst graphic ever to find its way into print.” Now it easy to understand why!
Picture

​In stark contrast, the best statistical graphs and charts are able to communicate complex ideas with clarity, precision, and efficiency. We think this graphic, among his many others, created by Nate Silver’s
fivethirtyeight.com exemplifies this thesis best:
Picture

​Even without reading the entire article discussing how
few new parents get paid time off, the casual observer understands the discrepancy when compared to paid sick time. The data is visually appealing, undistorted, and easily can be used for clear-cut comparisons. The data is taken from a reliable source, the Bureau of Labor Statistics.


My only qualm with this graphic is the lack of a timeframe. After reading the article, it can be assumed that this uses data from 2014. Additionally, the article admits that the proportion of all workers getting paid family leave is increasing at a much faster rate than paid sick time in the last fifteen years or so. Even with this additional context, little is added to the graphic as it accurately and efficiently shows the data it wants to depict in a nonbiased way.

To wrap things up, it is important to consider different ways of presenting important features or aspects, when framing a presentation or article, so that the facts are accurately represented. One display by itself may not be sufficient.

What follows are two displays arising from a study discussing older driver involvements in police reported crashes and fatal crashes. If you were only presented with the display on the top by itself, what impression is created regarding elderly drivers? How might your conclusions change if both displays are presented jointly? ​
Picture
Picture

These displays do not present a complete picture of the situation. For example, one question not answered by these displays is the percent fatality rate for each Age Group.  The actual article does provide additional information that can be used to estimate this rate, however.

After studying the situation, A F Williams and O Carsten summarized their conclusions as follows, “The youngest and oldest drivers have the highest crash risk, but the problem lies predominantly in the youngest age groups because elderly drivers have low exposure. The elderly driver problem will increase gradually as their share of the population increases but will remain relatively small. The bulk of the problem will continue to reside among drivers younger than age 65, particularly the youngest drivers.”

In conclusion, how data are presented to inform an audience will depend upon the audience, the medium and the setting. To be informative, data should be presented with suitable interpretation either verbal or written so that the intended message is comprehended.

The SaberSmart Team
Comments
comments powered by Disqus

    Archives

    August 2019
    July 2019
    January 2019
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    April 2018
    February 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016

    Categories

    All
    Analytics
    Big Data
    Computer Science
    Economics
    Essay
    Football
    Gambling
    History
    Mathematics
    MLB Teams
    NBA Teams
    NFL Teams
    Philosophy
    Super Bowl
    Triple Crown
    World Series

    RSS Feed

    Follow @sabersmartblog
    Tweets by sabersmartblog
 Support this site by clicking through the banner below:
  • Home
  • Blog
    • Throwback
  • Playoff Odds
    • MLB >
      • 2019 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2018 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
    • NBA >
      • 2018 >
        • Total Playoff
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Expected Wins
  • About
    • Contact
    • Comment Policy