Briefly Summarize the Six Phases of the House of Representatives

Data visualization

Data graphics provide one of the most accessible, compelling, and expressive modes to investigate and depict patterns in information. This affiliate will motivate why well-designed data graphics are of import and describe a taxonomy for understanding their composition. If you are seeing this textile for the get-go time, yous volition never look at data graphics the same way again—yours will shortly be a more than critical lens.

The 2012 federal election cycle

Every four years, the presidential election draws an enormous amount of interest in the United states of america. The most prominent candidates announce their candidacy nearly two years earlier the November elections, first the process of raising the hundreds of millions of dollars necessary to orchestrate a national campaign. In many ways, the experience of running a successful presidential campaign is in itself evidence of the leadership and organizational skills necessary to exist commander-in-chief.

Voices from all parts of the political spectrum are critical of the influence of money upon political campaigns. While the contributions from individual citizens to private candidates are express in various ways, the Supreme Court'due south decision in Citizens United five. Federal Election Commission allows unlimited political spending by corporations (non-turn a profit or otherwise). This has resulted in a system of committees (nearly notably, political activeness committees, PACs) that can accept unlimited contributions and spend them on behalf of (or against) a item candidate or prepare of candidates. Unraveling the complicated network of campaign spending is a subject area of bully interest.

To perform that unraveling is an exercise in information science. The Federal Election Commission (FEC) maintains a website with logs of not only all of the ($200 or more) contributions made by individuals to candidates and committees, simply likewise of spending by committees on behalf of (and confronting) candidates. Of grade, the FEC also maintains data on which candidates win elections, and by how much. These data sources are separate, and it requires some ingenuity to piece them together. We volition develop these skills in Chapters 4–6, but for at present, we will focus on graphical displays of the data that tin be gleaned from these information. Our emphasis at this phase is on making intelligent decisions near how to brandish certain information, then that a articulate (and correct) message is delivered.

Among the most bones questions is: How much money did each candidate raise? However, the convoluted campaign finance network makes even this simple question difficult to answer, and—mayhap more than importantly—less meaningful than nosotros might think. A better question is: On whose candidacy was the about money spent? In Figure 2.1, we prove a bar graph of the amount of coin (in millions of dollars) that were spent by committees on item candidates during the general ballot phase of the 2012 federal election cycle. This includes candidates for president, the United States Senate, and the United States House of Representatives. Only candidates on whose campaign at to the lowest degree $4 million was spent are included in Effigy 2.1.

$Amount of money spent on individual candidates in the general election phase of the 2012 federal election cycle, in millions of dollars. Candidacies with at least \$4 million in spending are depicted.$

Figure 2.1: Amount of money spent on individual candidates in the general ballot phase of the 2012 federal ballot bicycle, in millions of dollars. Candidacies with at to the lowest degree $four million in spending are depicted.

It seems clear from Figure 2.1 that President Barack Obama's re-ballot campaign spent far more money than any other candidate, in particular more than doubling the corporeality of coin spent past his Republican challenger, Mitt Romney. Withal, committees are not limited to spending money in support of a candidate—they tin can also spend money confronting a particular candidate (attack ads). In Effigy ii.2, we separate the same spending shown in Figure 2.ane by whether the money was spent for or against the candidate.

$Amount of money spent on individual candidates in the general election phase of the 2012 federal election cycle, in millions of dollars, broken down by type of spending. Candidacies with at least \$4 million in spending are depicted.$

Figure 2.2: Amount of coin spent on individual candidates in the general election phase of the 2012 federal election bike, in millions of dollars, broken downwardly by type of spending. Candidacies with at least $4 million in spending are depicted.

In these elections, nearly of the money was spent against each candidate, and in item, $251 one thousand thousand of the $274 meg spent on President Obama's entrada was spent against his candidacy. Similarly, most of the money spent on Hand Romney's campaign was confronting him, merely the percentage of negative spending on Romney's campaign (lxx%) was lower than that of Obama (92%).

The difference between Effigy 2.1 and Figure two.2 is that in the latter we have used colour to bring a third variable (type of spending) into the plot. This allows united states of america to make a clear comparing that importantly changes the conclusions nosotros might depict from the former plot. In particular, Figure 2.1 makes it appear as though President Obama'south state of war chest dwarfed that of Romney, when in fact the contrary was true.

Are these 2 groups different?

Since so much more money was spent attacking Obama's campaign than Romney'due south, you might conclude from Effigy 2.2 that Republicans were more successful in fundraising during this election cycle. In Figure ii.3, nosotros can confirm that this was indeed the case, since more money was spent supporting Republican candidates than Democrats, and more than money was spent attacking Democratic candidates than Republican. It also seems clear from Figure two.iii that nearly all of the money was spent on either Democrats or Republicans.²

Amount of money spent on individual candidacies by political party affiliation during the general election phase of the 2012 federal election cycle.

Figure 2.3: Amount of money spent on individual candidacies by political party affiliation during the general election phase of the 2012 federal election cycle.

Nonetheless, the question of whether the money spent on candidates really differed past party amalgamation is a bit thornier. Equally we saw above, the presidential election dominated the political donations in this election bicycle. Romney faced a serious disadvantage in trying to unseat an incumbent president. In this example, the function being sought is a confounding variable. Past further subdividing the contributions in Figure 2.3 by the function being sought, we tin see in Figure two.4 that while more money was spent supporting Republican candidates for all elective branches of government, information technology was only in the presidential election that more than coin was spent attacking Democratic candidates. In fact, slightly more money was spent attacking Republican Business firm and Senate candidates.

Amount of money spent on individual candidacies by political party affiliation during the general election phase of the 2012 federal election cycle, broken down by office being sought (House, President, or Senate).

Figure ii.4: Amount of money spent on individual candidacies by political party affiliation during the general election phase of the 2012 federal election bike, broken down past office being sought (House, President, or Senate).

Annotation that Figures 2.3 and 2.4 display the same information. In Figure ii.4, we have an additional variable that provides an important inkling into the mystery of campaign finance. Our choice to include that variable results in Figure 2.iv conveying substantially more than meaning than Figure 2.iii, even though both figures are "right." In this chapter, we volition begin to develop a framework for creating principled data graphics.

Graphing variation

Ane theme that arose during the presidential election was the accusation that Romney'due south campaign was supported by a few rich donors, whereas Obama'south support came from people across the economic spectrum. If this were truthful, then we would await to meet a departure in the distribution of donation amounts between the two candidates. In item, we would expect to meet this in the histograms shown in Effigy 2.five, which summarize the more than one one thousand thousand donations made past individuals to the two major committees that supported each candidate (for Obama, Obama for America, and the Obama Victory Fund 2012; for Romney, Romney for President, and Romney Victory 2012). We do meet some evidence for this merits in Figure 2.5, Obama did appear to receive more smaller donations, merely the evidence is far from conclusive. One problem is that both candidates received many small donations but just a few larger donations; the scale on the horizontal centrality makes it difficult to actually see what is going on. Secondly, the histograms are hard to compare in a side-by-side placement. Finally, we have lumped all of the donations from both phases of the presidential election (i.e., primary vs. general) in together.

Donations made by individuals to the PACs supporting the two major presidential candidates in the 2012 election.

Figure 2.5: Donations made by individuals to the PACs supporting the two major presidential candidates in the 2012 election.

In Figure 2.half dozen, we remedy these issues past (1) using density curves instead of histograms, and so that we can compare the distributions directly, (2) plotting the logarithm of the donation amount on the horizontal scale to focus on the information that are of import, and (three) separating the donations by the phase of the election. Figure 2.6 allows us to brand more nuanced conclusions. The right console supports the allegation that Obama'due south donations came from a broader base during the main election phase. It does announced that more of Obama's donations came in smaller amounts during this stage of the ballot. Yet, in the general phase, there is virtually no difference in the distribution of donations made to either campaign.

Donations made by individuals to the PACs supporting the two major presidential candidates in the 2012 election, separated by election phase.

Figure 2.6: Donations made by individuals to the PACs supporting the two major presidential candidates in the 2012 ballot, separated past ballot phase.

Examining relationships amongst variables

Naturally, the biggest questions raised past the Citizens United decision are about the influence of money in elections. If campaign spending is unlimited, does this mean that the candidate who generates the most spending on their behalf will earn the most votes? One way that nosotros might accost this question is to compare the corporeality of money spent on each candidate in each ballot with the number of votes that candidate earned. Statisticians volition want to know the correlation between these 2 quantities—when i is loftier, is the other one likely to be high as well?

Since all 435 members of the Usa House of Representatives are elected every 2 years, and the districts incorporate roughly the same number of people, House elections provide a nice data ready to make this blazon of comparison. In Effigy 2.7, we show a simple scatterplot relating the number of dollars spent on behalf of the Autonomous candidate against the number of votes that candidate earned for each of the Firm elections.

Scatterplot illustrating the relationship between number of dollars spent supporting and number of votes earned by Democrats in 2012 elections for the House of Representatives.

Figure ii.7: Scatterplot illustrating the relationship between number of dollars spent supporting and number of votes earned by Democrats in 2012 elections for the Firm of Representatives.

The human relationship between the two quantities depicted in Figure 2.7 is very weak. It does not appear that candidates who benefited more from campaign spending earned more votes. Withal, the comparison in Figure 2.vii is misleading. On both axes, information technology is not the amount that is important, but the proportion. Although the population of each congressional district is similar, they are not the same, and voter turnout volition vary based on a diverseness of factors. Past comparing the proportion of the vote, we can command for the size of the voting population in each district. Similarly, it makes less sense to focus on the total corporeality of money spent, every bit opposed to the proportion of coin spent. In Figure 2.eight, we nowadays the same comparing, but with both axes scaled to proportions.

Scatterplot illustrating the relationship between proportion of dollars spent supporting and proportion of votes earned by Democrats in the 2012 House of Representatives elections. Each dot represents one district. The size of each dot is proportional to the total spending in that election, and the alpha transparency of each dot is proportional to the total number of votes in that district.

Figure 2.8: Scatterplot illustrating the human relationship between proportion of dollars spent supporting and proportion of votes earned by Democrats in the 2012 House of Representatives elections. Each dot represents 1 commune. The size of each dot is proportional to the total spending in that election, and the alpha transparency of each dot is proportional to the total number of votes in that district.

Effigy 2.8 captures many nuances that were impossible to see in Effigy 2.7. First, there does appear to be a positive association between the percentage of money supporting a candidate and the percentage of votes that they earn. However, that relationship is of greatest interest towards the middle of the plot, where elections are actually contested. Outside of this region, i candidate wins more 55% of the vote. In this example, there is usually very picayune money spent. These are considered "safe" Firm elections—you lot can run into these points on the plot considering most of them are close to $x=0$ or $ten=1$, and the dots are very small. For example, one of the points in the lower-left corner is the 8th district in Ohio, which was won by the so Speaker of the House John Boehner, who ran unopposed. The election in which the most money was spent (over $11 million) was also in Ohio. In the 16th district, Republican incumbent Jim Renacci narrowly defeated Democratic challenger Betty Sutton, who was herself an incumbent from the 13th district. This battle was made possible through decennial redistricting (see Affiliate 17). Of the money spent in this election, 51.two% was in back up of Sutton just she earned only 48.0% of the votes.

In the center of the plot, the dots are bigger, indicating that more coin is beingness spent on these contested elections. Of course this makes sense, since candidates who are fighting for their political lives are more than likely to fundraise aggressively. Nevertheless, the evidence that more fiscal support correlates with more votes in contested elections is relatively weak.

Networks

Not all relationships among variables are sensibly expressed by a scatterplot. Some other way in which variables can be related is in the course of a network (nosotros will hash out these in more detail in Chapter 20). In this case, campaign funding has a network structure in which individuals donate money to committees, and committees then spend money on behalf of candidates. While the national campaign funding network is far as well circuitous to show here, in Figure 2.9 we display the funding network for candidates from Massachusetts.

In Figure 2.ix, nosotros run across that the two campaigns that benefited the most from commission spending were Republicans Mitt Romney and Scott Brown. This is not surprising, since Romney was running for president and received massive donations from the Republican National Committee, while Brown was running to keep his Senate seat in a heavily Autonomous land against a potent challenger, Elizabeth Warren. Both men lost their elections. The constellation of bluish dots are the congressional delegation from Massachusetts, all of whom are Democrats.

Campaign funding network for candidates from Massachusetts, 2012 federal elections. Each edge represents a contribution from a PAC to a candidate.

Effigy 2.9: Entrada funding network for candidates from Massachusetts, 2012 federal elections. Each edge represents a contribution from a PAC to a candidate.

Composing data graphics

Former New York Times intern and FlowingData.com creator Nathan Yau makes the illustration that creating data graphics is like cooking: Anyone can learn to type graphical commands and generate plots on the computer. Similarly, anyone can heat up nutrient in a microwave. What separates a high-quality visualization from a apparently one are the same elements that separate keen chefs from novices: mastery of their tools, knowledge of their ingredients, insight, and creativity (Yau 2013). In this section, we present a framework—rooted in scientific research—for understanding information graphics. Our promise is that by internalizing these ideas you will refine your data graphics palette.

A taxonomy for data graphics

The taxonomy presented in Yau (2013) provides a systematic style of thinking about how information graphics convey specific pieces of information and how they could be improved. A complementary grammar of graphics (Wilkinson et al. 2005) is implemented past Hadley Wickham in the ggplot2 graphics package (Hadley Wickham 2016), albeit using slightly different terminology. For clarity, we will postpone discussion of ggplot2 until Chapter 3. (To extend our cooking analogy, you must learn to taste before you can larn to cook well.)

In this framework, data graphics tin can be understood in terms of iv basic elements: visual cues, coordinate systems, scale, and context. In what follows, nosotros explain this vision and append a few additional items (facets and layers). This section should equip the careful reader with the ability to systematically break down data graphics, enabling a more than critical assay of their content.

Visual Cues

Visual cues are graphical elements that depict the eye to what you want your audience to focus upon. They are the central edifice blocks of information graphics, and the choice of which visual cues to apply to represent which quantities is the central question for the data graphic composer. Table 2.1 identifies nine distinct visual cues, for which we also list whether that cue is used to encode a numerical or categorical quantity:

Tabular array 2.1: Visual cues and what they signify.
Visual Cue	Variable Type	Question
Position	numerical	where in relation to other things?
Length	numerical	how big (in one dimension)?
Angle	numerical	how wide? parallel to something else?
Direction	numerical	at what slope? in a fourth dimension serial, going up or downward?
Shape	chiselled	belonging to which grouping?
Area	numerical	how big (in 2 dimensions)?
Volume	numerical	how big (in three dimensions)?
Shade	either	to what extent? how severely?
Color	either	to what extent? how severely?

Research into graphical perception (dating back to the mid-1980s) has shown that man beings' ability to perceive differences in magnitude accurately descends in this order (Cleveland and McGill 1984). That is, humans are quite skillful at accurately perceiving differences in position (e.one thousand., how much taller one bar is than another), but non every bit skilful at perceiving differences in angles. This is i reason why many people prefer bar charts to pie charts. Our relatively poor ability to perceive differences in color is a major factor in the relatively low stance of heat maps that many data scientists take.

Coordinate systems

How are the data points organized? While any number of coordinate systems are possible, three are most common:

Cartesian: The familiar $(x,y)$-rectangular coordinate system with two perpendicular axes.
Polar: The radial analog of the Cartesian system with points identified by their radius $\rho$ and angle $\theta$.
Geographic: The increasingly important system in which we have locations on the curved surface of the Earth, but nosotros are trying to represent these locations in a flat two-dimensional airplane. We volition discuss such geospatial analyses in Affiliate 17.

An advisable option for a coordinate system is critical in representing one'due south data accurately, since, for example, displaying geospatial data like airline routes on a flat Cartesian aeroplane tin can atomic number 82 to gross distortions of reality (come across Department 17.3.2).

Scale

Scales translate values into visual cues. The option of scale is ofttimes crucial. The fundamental question is how does distance in the data graphic interpret into meaningful differences in quantity? Each coordinate axis can accept its own calibration, for which we have three dissimilar choices:

Numeric: A numeric quantity is most commonly set on a linear, logarithmic, or per centum scale. Note that a logarithmic scale does non have the property that, say, a one-centimeter difference in position corresponds to an equal difference in quantity anywhere on the scale.
Categorical: A categorical variable may take no ordering (east.g., Democrat, Republican, or Independent), or it may exist ordinal (e.chiliad., never, former, or current smoker).
Time: A numeric quantity that has some special properties. First, because of the calendar, information technology can be demarcated past a serial of different units (e.thousand., twelvemonth, calendar month, twenty-four hours, etc.). 2d, it can be considered periodically (or cyclically) as a "wrap-effectually" calibration. Fourth dimension is also and so unremarkably used and misused that information technology warrants conscientious consideration.

Misleading with scale is easy, since information technology has the potential to completely distort the relative positions of data points in whatever graphic.

Context

The purpose of data graphics is to assist the viewer brand meaningful comparisons, but a bad information graphic tin can practise just the reverse: It tin can instead focus the viewer's attention on meaningless artifacts, or ignore crucial pieces of relevant only external knowledge. Context can be added to data graphics in the class of titles or subtitles that explain what is beingness shown, axis labels that make it clear how units and scale are depicted, or reference points or lines that contribute relevant external data. While 1 should avert cluttering upwardly a data graphic with excessive annotations, it is necessary to provide proper context.

Minor multiples and layers

1 of the central challenges of creating data graphics is condensing multivariate information into a two-dimensional image. While three-dimensional images are occasionally useful, they are ofttimes more than confusing than annihilation else. Instead, hither are three common means of incorporating more than variables into a 2-dimensional data graphic:

Pocket-sized multiples: Also known as facets, a single information graphic tin be equanimous of several small multiples of the same basic plot, with ane (discrete) variable irresolute in each of the modest sub-images.
Layers: It is sometimes advisable to depict a new layer on top of an existing data graphic. This new layer can provide context or comparison, merely there is a limit to how many layers humans can reliably parse.
Blitheness: If time is the additional variable, then an animation can sometimes effectively convey changes in that variable. Of course, this doesn't work on the printed page and makes it impossible for the user to encounter all the data at once.

Color

Color is i of the flashiest, but most misperceived and misused visual cues. In making color choices, at that place are a few fundamental ideas that are important for whatsoever data scientist to understand.

Commencement, as we saw higher up, color and its monochromatic cousin shade are two of the almost poorly perceived visual cues. Thus, while potentially useful for a modest number of levels of a categorical variable, color and shade are not particularly faithful ways to represent numerical variables—especially if small differences in those quantities are of import to distinguish. This means that while colour can be visually appealing to humans, information technology often isn't as informative as we might promise. For two numeric variables, it is difficult to think of examples where color and shade would be more than useful than position. Where colour can be most constructive is to represent a third or fourth numeric quantity on a scatterplot—one time the two position cues have been exhausted.

Second, approximately 8% of the population—most of whom are men—accept some form of color blindness. Most ordinarily, this renders them incapable of seeing colors accurately, most notably of distinguishing between ruby and green. Compounding the problem, many of these people do not know that they are color-blind. Thus, for professional graphics information technology is worth thinking carefully about which colors to utilise. The National Football League famously failed to account for this in a 2022 game in which the Buffalo Bills wore all-ruddy jerseys and the New York Jets wore all-dark-green, leaving colorblind fans unable to distinguish i team from the other!

To forestall issues with color blindness, avert contrasting red with light-green in data graphics. As a bonus, your plots won't seem Christmas-y!

Thankfully, we have been freed from the burden of having to create such intelligent palettes by the research of Cynthia Brewer, creator of the ColorBrewer website (and inspiration for the RColorBrewer R package). Brewer has created colorblind-rubber palettes in a variety of hues for three different types of numeric information in a single variable:

Sequential: The ordering of the data has but 1 direction. Positive integers are sequential considering they can just go up: they can't become past 0. (Thus, if 0 is encoded equally white, then any darker shade of gray indicates a larger number.)
Diverging: The ordering of the data has two directions. In an election forecast, we unremarkably see states colored based on how they are expected to vote for the president. Since scarlet is associated with Republicans and blueish with Democrats, states that are solidly ruddy or blue are on opposite ends of the scale. But "swing states" that could go either manner may appear purple, white, or some other neutral color that is "between" cherry and blueish (see Figure two.10).
Qualitative: There is no ordering of the information, and we simply need color to differentiate different categories.

Diverging red-blue color palette.

Figure 2.10: Diverging red-blue color palette.

The RColorBrewer bundle provides functionality to employ these palettes directly in R. Effigy 2.eleven illustrates the sequential, qualitative, and diverging palettes built into RColorBrewer.

Palettes available through the RColorBrewer package.

Figure 2.11: Palettes available through the RColorBrewer package.

Accept the actress time to use a well-designed color palette. Have that those who piece of work with color for a living will probably cull better colors than you lot.

Other fantabulous perceptually distinct color palettes are provided by the viridis package. These palettes mimic those that are used in the matplotlib plotting library for Python. The viridis palettes are also accessible in ggplot2 through, for example, the scale_color_viridis() function.

Dissecting information graphics

With a little exercise, i can larn to dissect information graphics in terms of the taxonomy outlined in a higher place. For example, your basic scatterplot uses position in the Cartesian airplane with linear scales to show the relationship between two variables. In what follows, we identify the visual cues, coordinate system, and calibration in a serial of simple data graphics.

The bar graph in Figure ii.12 displays the average score on the math portion of the 1994–1995 Sat (with possible scores ranging from 200 to 800) among states for whom at least two-thirds of the students took the Sat.

Bar graph of average SAT scores among states with at least two-thirds of students taking the test.

Effigy two.12: Bar graph of boilerplate SAT scores among states with at least two-thirds of students taking the examination.

This plot uses the visual cue of length to correspond the math Saturday score on the vertical centrality with a linear scale. The categorical variable of state is arrayed on the horizontal axis. Although the states are ordered alphabetically, information technology would not be appropriate to consider the state variable to be ordinal, since the ordering is non meaningful in the context of math SAT scores. The coordinate organisation is Cartesian, although equally noted previously, the horizontal coordinate is meaningless. Context is provided by the centrality labels and championship. Note likewise that since 200 is the minimum score possible on each department of the SAT, the vertical centrality has been constrained to start at 200.

Next, nosotros consider a fourth dimension serial that shows the progression of the world record times in the 100-meter freestyle swimming event for men and women. Figure two.13 displays the times equally a function of the year in which the new tape was gear up.

Scatterplot of world record time in 100-m freestyle swimming.

Figure 2.thirteen: Scatterplot of world record fourth dimension in 100-1000 freestyle swimming.

At some level this is only a scatterplot that uses position on both the vertical and horizontal axes to indicate swimming time and chronological time, respectively, in a Cartesian plane. The numeric scale on the vertical centrality is linear, in units of seconds, while the scale on the horizontal centrality is also linear, measured in years. Only there is more going on hither. Colour is being used every bit a visual cue to distinguish the categorical variable sexual activity. Furthermore, since the points are connected by lines, management is being used to betoken the progression of the tape times. (In this case, the records can only get faster, then the direction is e'er down.) 1 might even argue that angle is being used to compare the descent of the world records across fourth dimension and/or gender. In fact, in this case shape is also being used to distinguish sex.

Next, we present 2 pie charts in Figure 2.xiv indicating the different substance of abuse for subjects in the Health Evaluation and Linkage to Chief Care (Assist) clinical trial ("Linking Alcohol and Drug Dependent Adults to Primary Medicalcare: Aa Randomized Controlled Trial of a Multidisciplinaryhealth Intervention in a Detoxification Unit of measurement" 2003). Each bailiwick was identified with involvement with one chief substance (alcohol, cocaine, or heroin). On the right, we meet the distribution of substance for housed (no nights in shelter or on the street) participants is fairly evenly distributed, while on the left, we run into the same distribution for those who were homeless one or more nights (more likely to have alcohol every bit their principal substance of abuse).

Pie charts showing the breakdown of substance of abuse among HELP study participants, faceted by homeless status. Compare this to Figure 3.13.

Figure 2.14: Pie charts showing the breakdown of substance of abuse amidst Help study participants, faceted by homeless condition. Compare this to Figure iii.13.

This graphic uses a radial coordinate arrangement and the visual cue of color to distinguish the iii levels of the chiselled variable substance. The visual cue of angle is existence used to quantify the differences in the proportion of patients using each substance. Are you able to accurately identify these percentages from the effigy? The bodily percentages are shown as follows.

              # A tibble: 3 × 3   substance Homeless        Housed           <fct>     <chr>           <chr>          1 booze   due north = 103 (49.3%) north = 74 (xxx.3%) 2 cocaine   n = 59 (28.2%)  n = 93 (38.ane%) iii heroin    n = 47 (22.5%)  n = 77 (31.6%)

This is a case where a unproblematic tabular array of these proportions is more effective at communicating the true differences than this—and probably any—data graphic. Note that there are only six information points presented, so any graphic is probably gratuitous.

Don't apply pie charts, except perhaps in small-scale multiples.

Finally, in Effigy ii.xv we nowadays a choropleth map showing the population of Massachusetts by the 2010 Census tracts.

Choropleth map of population among Massachusetts Census tracts, based on 2022 American Community Survey.

Figure ii.15: Choropleth map of population amidst Massachusetts Census tracts, based on 2022 American Community Survey.

Clearly, we are using a geographic coordinate system here, with latitude and longitude on the vertical and horizontal axes, respectively. (This plot is not projected: More information about projection systems is provided in Affiliate 17.) Shade is once over again being used to represent the quantity population, only here the scale is more complicated. The ten shades of blue accept been mapped to the deciles of the census tract populations, and since the distribution of population beyond these tracts is right-skewed, each shade does not correspond to a range of people of the aforementioned width, merely rather to the same number of tracts that have a population in that range. Helpful context is provided by the title, subtitle, and legend.

Importance of data graphics: Challenger

On January 27th, 1986, engineers at Morton Thiokol, who supplied solid rocket motors (SRMs) to NASA for the space shuttle, recommended that NASA filibuster the launch of the space shuttle Challenger due to concerns that the cold weather forecast for the next day'southward launch would jeopardize the stability of the condom O-rings that held the rockets together. These engineers provided 13 charts that were reviewed over a two-hour conference call involving the engineers, their managers, and NASA. The engineers' recommendation was overruled due to a lack of persuasive evidence, and the launch proceeded on schedule. The O-rings failed in exactly the manner the engineers had feared 73 seconds after launch, Challenger exploded, and all seven astronauts on lath died (Tufte 1997).

In addition to the tragic loss of life, the incident was a devastating blow to NASA and the U.s. space program. The manus-wringing that followed included a two-and-a-one-half twelvemonth hiatus for NASA and the formation of the Rogers Commission to study the disaster. What became clear is that the Morton Thiokol engineers had correctly identified the key causal link between temperature and O-ring harm. They did this using statistical information assay combined with a plausible physical explanation: in short, that the safety O-rings became brittle in low temperatures. (This link was famously demonstrated past legendary physicist and Rogers Commission member Richard Feynman during the hearings, using a glass of water and some ice cubes (Tufte 1997).) Thus, the engineers were able to place the disquisitional weakness using their domain knowledge—in this instance, rocket science—and their data analysis.

Their failure—and its horrific consequences—was one of persuasion: They only did non nowadays their evidence in a convincing manner to the NASA officials who ultimately made the decision to go along with the launch. More than 30 years later this tragedy remains critically important. The show brought to the discussions about whether to launch was in the form of hand written information tables (or "charts"), only none were graphical. In his sweeping critique of the incident, Edward Tufte created a powerful scatterplot similar to Figures 2.16 and two.17, which were derived from data that the engineers had at the time, but in a far more effective presentation (Tufte 1997).

A scatterplot with smoother demonstrating the relationship between temperature and O-ring damage on solid rocket motors. The dots are semi-transparent, so that darker dots indicate multiple observations with the same values.

Figure ii.16: A scatterplot with smoother demonstrating the relationship between temperature and O-ring impairment on solid rocket motors. The dots are semi-transparent, so that darker dots indicate multiple observations with the same values.

Figure ii.16 indicates a clear relationship betwixt the ambient temperature and O-ring damage on the solid rocket motors. To demonstrate the dramatic extrapolation made to the predicted temperature on January 27th, 1986, Tufte extended the horizontal axis in his scatterplot (Effigy 2.17) to include the forecast temperature. The huge gap makes obviously the problem with extrapolation. Reprints of 2 Morton Thiokol data graphics are shown in Figures 2.18 and ii.19 (Tufte 1997).

A recreation of Tufte's scatterplot demonstrating the relationship between temperature and O-ring damage on solid rocket motors.

Effigy 2.17: A recreation of Tufte'due south scatterplot demonstrating the human relationship between temperature and O-ring damage on solid rocket motors.

One of the original 13 charts presented by Morton Thiokol engineers to NASA on the conference call the night before the Challenger launch. This is one of the more data-intensive charts.

Figure 2.xviii: One of the original 13 charts presented by Morton Thiokol engineers to NASA on the conference call the night earlier the Challenger launch. This is one of the more data-intensive charts.

Evidence presented during the congressional hearings after the Challenger explosion.

Figure two.19: Bear witness presented during the congressional hearings afterwards the Challenger explosion.

Tufte provides a full critique of the engineers' failures (Tufte 1997), many of which are instructive for data scientists.

Lack of authorship: There were no names on any of the charts. This creates a lack of accountability. No single person was willing to take responsibility for the data contained in any of the charts. Information technology is much easier to refute an argument made by a group of nameless people, than to a single or grouping of named people.
Univariate analysis: The engineers provided several data tables, simply all were substantially univariate. That is, they presented information on a single variable, merely did non illustrate the relationship betwixt two variables. Note that while Figure two.18 does testify data for two different variables, it is very difficult to see the connection betwixt the ii in tabular course. Since the crucial connexion hither was between temperature and O-ring damage, this lack of bivariate assay was probably the single most damaging omission in the engineers' presentation.
Anecdotal evidence: With such a small sample size, anecdotal evidence can be peculiarly challenging to abnegate. In this case, a artificial comparison was fabricated based on 2 observations. While the engineers argued that SRM-15 had the most damage on the coldest previous launch appointment (meet Figure 2.17), NASA officials were able to counter that SRM-22 had the 2nd-most damage on one of the warmer launch dates. These anecdotal pieces of evidence fall apart when all of the data are considered in context—in Figure 2.17, it is clear that SRM-22 is an outlier that deviates from the general pattern—merely the engineers never presented all of the data in context.
Omitted data: For some reason, the engineers chose not to present information from 22 other flights, which collectively represented 92% of launches. This may have been due to time constraints. This dramatic reduction in the accumulated evidence played a part in enabling the anecdotal evidence outlined above.
Confusion: No doubt working against the clock, and about probable working in tandem, the engineers were not always articulate well-nigh 2 different types of damage: erosion and accident-past. A failure to clearly ascertain these terms may have hindered agreement on the part of NASA officials.
Extrapolation: Most forcefully, the failure to include a simple scatterplot of the full information obscured the "stupendous extrapolation" (Tufte 1997) necessary to justify the launch. The lesser line was that the forecast launch temperature (between 26 and 29 degrees Fahrenheit) was then much colder than anything that had occurred previously, whatsoever model for O-ring damage as a function of temperature would be untested.

When more than a handful of observations are present, data graphics are often more than revealing than tables. Always consider culling representations to improve communication.

Tufte notes that the fundamental sin of the engineers was a failure to frame the information in relation to what? The notion that sure information may be understood in relation to something is perhaps the fundamental and defining characteristic of statistical reasoning. We will follow this thread throughout the book.

Always ensure that graphical displays are clearly described with appropriate axis labels, additional text descriptions, and a caption.

We present this tragic episode in this affiliate equally motivation for a careful written report of data visualization. Information technology illustrates a disquisitional truism for practicing data scientists: Being right isn't enough—you have to be disarming. Note that Figure two.nineteen contains the same data that are present in Figure two.17 but in a far less suggestive format. It just so happens that for most human beings, graphical explanations are peculiarly persuasive. Thus, to be a successful data analyst, one must main at least the nuts of data visualization.

Creating constructive presentations

Giving effective presentations is an of import skill for a information scientist. Whether these presentations are in academic conferences, in a classroom, in a boardroom, or even on stage, the ability to communicate to an audience is of immeasurable value. While some people may exist naturally more than comfortable in the limelight, everyone can improve the quality of their presentations.

A few pieces of full general advice are warranted (Ludwig 2012):

Upkeep your time: Oftentimes you will just have a few minutes to speak and usually a few boosted minutes to answer questions. If your talk runs too short or also long, information technology makes you lot seem unprepared. Rehearse your talk several times in order to get a better feel for your timing. Note also that you may have a tendency to talk faster during your actual talk than y'all will during your rehearsal. Talking faster in order to speed upwards is a bad strategy—you are much improve off simply cutting material ahead of time. You will probably have a hard time getting through $x$ slides in $10$ minutes.

Talking faster in order to speed up is not a expert strategy—y'all are much better off simply cutting cloth ahead of time or moving to a primal slide or determination.

Don't write too much on each slide: You don't want people to have to read your slides, because if the audience is reading your slides, so they aren't listening to you. You want your slides to provide visual cues to the points that you are making—non substitute for your spoken words. Concentrate on graphical displays and bullet-pointed lists of ideas.
Put your trouble in context: Remember that (in almost cases) most of your audition volition have picayune or no knowledge of your subject matter. The easiest fashion to lose people is to dive correct into technical details that require prior domain knowledge. Spend a few minutes at the beginning of your talk introducing your audience to the nigh basic aspects of your topic and presenting some motivation for what you are studying.
Speak loudly and clearly: Remember that (in most cases) you know more than well-nigh your topic that anyone else in the room, so speak and deed with confidence!
Tell a story, but not necessarily the whole story: Information technology is unrealistic to expect that yous can tell your audience everything that you know about your topic in $x$ minutes. You should strive to convey the big ideas in a clear fashion only not dwell on the details. Your talk will be successful if your audience is able to walk away with an understanding of what your research question was, how yous addressed it, and what the implications of your findings are.

The wider world of data visualization

Thus far our word of data visualization has been limited to static, ii-dimensional information graphics. However, there are many additional means to visualize data. While Chapter 3 focuses on static information graphics, Chapter 14 presents several cutting-edge tools for making interactive information visualizations. Even more broadly, the field of visual analytics is concerned with the science behind building interactive visual interfaces that raise one'due south ability to reason about data.

Finally, nosotros have data art. You can exercise many things with data. On one end of the spectrum, you might be focused on predicting the result of a specific response variable. In such cases, your goal is very well-divers and your success can be quantified. On the other end of the spectrum are projects chosen data art, wherein the meaning of what you are doing with the data is elusive, but the experience of viewing the data in a new way is in itself meaningful.

Consider Memo Akten and Quayola's Forms, which was inspired by the physical movement of athletes in the Commonwealth Games. Through video assay, these movements were translated into three-dimensional digital objects shown in Effigy ii.20. Note how the image in the upper-left is evocative of a swimmer surfacing after a swoop. When viewed as a movie, Forms is an arresting case of data fine art.

Successful data art projects require both creative talent and technical ability. Before The states is the Salesman'south House is a live, continuously-updating exploration of the online market eBay. This installation was created past statistician Mark Hansen and digital artist Jer Thorpe and is projected on a big screen as you lot enter eBay'south campus.

The display begins by pulling up Arthur Miller's archetype play Death of a Salesman, and "reading" the text of the beginning chapter. Along the way, several nouns are plucked from the text (due east.grand., flute, refrigerator, chair, bed, bays, etc.). For each in succession, the display then shifts to a geographic display of where things with that noun in the description are currently being sold on eBay, replete with price and auction data. (Note that these descriptions are not ever perfect. In the video, a search for "refrigerator" turns upward a T-shirt of one-time Chicago Bears defensive cease William [Fridge] Perry.)

Adjacent, one city where such an particular is being sold is chosen, and any classic books of American literature beingness sold nearby are collected. I is chosen, and the bike returns to the beginning past "reading" the first folio of that volume. This process continues indefinitely. When describing the exhibit, Hansen spoke of "ane data set reading some other." It is this interplay of data and literature that makes such data art projects and so powerful.

Finally, we consider another Mark Hansen collaboration, this time with Ben Rubin and Michele Gorman. In Shakespeare Machine, 37 digital LCD blades—each corresponding to one of Shakespeare'south plays—are arrayed in a circle. The display on each blade is a pattern of words culled from the text of these plays. First, pairs of hyphenated words are shown. Side by side, Boolean pairs (east.g., "skillful or bad") are found. Third, manufactures and adjectives modifying nouns (eastward.m., "the holy father"). In this manner, the artistic masterpieces of Shakespeare are shattered into formulaic chunks. In Affiliate 19, we volition learn how to utilise regular expressions to find the information for Shakespeare Machine.

Farther resource

While issues related to data visualization pervade this entire text, they will be the particular focus of Chapters 3 (Data visualization II), xiv (Information visualization Iii), and 17 (Geospatial data).

No education in data graphics is complete without reading Tufte's Visual Display of Quantitative Data (Tufte 2001), which also contains a clarification of John Snow's cholera map (see Chapter 17). For a full description of the Challenger incident, see (Tufte 1997). Tufte has also published two other landmark books (Tufte 1990, 2006), as well as reasoned polemics about the shortcomings of PowerPoint (Tufte 2003). Cleveland and McGill (1984) provide the foundation for Yau'southward taxonomy (Yau 2013). Yau (2011) provides many examples of idea-provoking data visualizations, peculiarly data fine art. The grammar of graphics was first described by Wilkinson et al. (2005). Hadley Wickham (2016) implemented ggplot2 based on this formulation.

Many important data graphics were developed by Tukey (1990). A. Gelman, Pasarica, and Dodhia (2002) take also written persuasively near data graphics in statistical journals. Gelman discusses a set of approved data graphics too as Tufte's suggested modifications to them. D. Nolan and Perrett (2016) hash out data visualization assignments and rubrics that tin can be used to grade them. Steven J. Murdoch has created some R functions for drawing the kind of modified diagrams described in Tufte (2001). These also announced in the ggthemes package (Arnold 2019).

Cynthia Brewer's color palettes are available at http://colorbrewer2.org and through the RColorBrewer package. Her work is described in more than detail in Brewer (1994) and Brewer (1999). The viridis (Garnier 2021a) and viridisLite (Garnier 2021b) packages provide matplotlib-like palettes for R. Ram and Wickham (2018) created the whimsical color palette that evokes Wes Anderson'south distinctive movies. Technically Speaking is an NSF-funded project for presentation advice that contains instructional videos for students (Ludwig 2012).

Exercises

Problem 1 (Easy): Consider the post-obit data graphic.

The am variable takes the value 0 if the car has automated transmission and 1 if the car has transmission transmission. How could yous differentiate the cars in the graphic based on their transmission type?

Problem 2 (Medium): Pick one of the Science Notebook entries at https://www.edwardtufte.com/tufte (eastward.g., "Making amend inferences from statistical graphics"). Write a brief reflection on the graphical principles that are illustrated by this entry.

Problem 3 (Medium): Observe 2 graphs published in a newspaper or on the internet in the last 2 years.

Place a graphical brandish that you find compelling. What aspects of the display work well, and how do these relate to the principles established in this chapter? Include a screen shot of the display along with your solution.
Place a graphical brandish that you lot discover less than compelling. What aspects of the brandish don't work well? Are in that location ways that the display might be improved? Include a screen shot of the display along with your solution.

Problem iv (Medium): Find 2 scientific papers from the last two years in a peer-reviewed journal (Nature and Science are skilful choices).

Identify a graphical brandish that yous discover compelling. What aspects of the display work well, and how do these relate to the principles established in this chapter? Include a screen shot of the display along with your solution.
Identify a graphical display that yous find less than compelling. What aspects of the brandish don't work well? Are in that location ways that the display might be improved? Include a screen shot of the brandish along with your solution.

Problem 5 (Medium): Consider the two graphics related to The New York Times "Taxmageddon" article at http://www.nytimes.com/2012/04/fifteen/sunday-review/coming-shortly-taxmageddon.html. The first is "Whose Revenue enhancement Rates Rose or Fell" and the 2d is "Who Gains Nearly From Revenue enhancement Breaks."

Examine the two graphics carefully. Discuss what you think they convey. What story do the graphics tell?
Evaluate both graphics in terms of the taxonomy described in this chapter. Are the scales appropriate? Consequent? Conspicuously labeled? Do variable dimensions exceed data dimensions?
What, if anything, is misleading about these graphics?

Trouble 6 (Medium): Consider the information graphic http://tinyurl.com/nytimes-unplanned well-nigh birth control methods.

What quantity is being shown on the $y$-centrality of each plot?
Listing the variables displayed in the data graphic, along with the units and a few typical values for each.
List the visual cues used in the data graphic and explain how each visual cue is linked to each variable.
Examine the graphic carefully. Describe, in words, what information y'all remember the information graphic conveys. Do non just summarize the data—interpret the data in the context of the problem and tell us what it means. (Notation: information is meaningful to human beings—it is non the aforementioned affair as data.)

Supplementary exercises

Available at https://mdsr-book.github.io/mdsr2e/ch-vizI.html#datavizI-online-exercises

Trouble 1 (Easy): Consider the following information-driven prototype, available for purchase at NBA Playoff Rings: https://cdn.shopify.com/s/files/1/0144/6552/products/NBA-Basketball game-2013-_6_1024x1024.jpg

Identify the visual cues, coordinate organization, and scale(due south).
How many variables are depicted in the graphic? Explicitly link each variable to a visual cue that you lot listed above.
Critique this data graphic using the taxonomy described in this chapter.

Trouble 2 (Easy): 2022 ELECTION: Consider the following data graphic about results from the 2022 presidential ballot in Massachusetts.

What type of colour palette is used in this graphic?

Problem iii (Easy): Choose one of the data graphics listed at http://mdsr-book.github.io/exercises.html#exercise_23 and answer the following questions. Exist sure to betoken which graphical brandish yous picked.

Globe'southward Top x Best Selling Cigarette Brands, 2004-2007
~~GNPD Usage past Food Categories~~
UK University Rankings
Childhood Obesity in the Us
Relationship between ages and psychosocial maturity

Identify the visual cues, coordinate system, and calibration(s).
How many variables are depicted in the graphic? Explicitly link each variable to a visual cue that you listed higher up.
Critique this data graphic using the taxonomy described in this affiliate.

Problem iv (Medium): Respond the following questions for each of the following collections of data graphics listed at (http://mdsr-book.github.io/exercises.html#exercise_24).

What is a Information Scientist?
Charts that explain food in America

Briefly (i paragraph) critique the designer's choices. Would you accept made unlike choices? Why or why not?

Note: Each link contains a collection of many information graphics, and nosotros don't expect (or desire) you to write a total report on each individual graphic. But each collection shares some common stylistic elements. You lot should comment on a few things that you notice about the blueprint of the drove.

Trouble v (Medium): Consider one of the more complicated data graphics listed at (http://mdsr-book.github.io/exercises.html#exercise_25):

What story does the information graphic tell? What is the primary message that you accept away from it?
Tin can the data graphic be described in terms of the taxonomy presented in this chapter? If and so, list the visual cues, coordinate system, and scales(s) equally you did in Trouble 2(a). If not, draw the feature of this information graphic that lies outside of that taxonomy.
Critique and/or praise the visualization choices made past the designer. Do they work? Are they misleading? Idea-provoking? Vivid? Are in that location things that you would have done differently? Justify your response.

lemayshas1947.blogspot.com

Source: https://mdsr-book.github.io/mdsr2e/ch-vizI.html