forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
nested_proportions.Rmd
669 lines (585 loc) · 35 KB
/
nested_proportions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
```{r echo = FALSE, message = FALSE, warning = FALSE}
# run setup script
source("_common.R")
library(ggforce)
library(treemapify)
```
# Visualizing nested proportions {#nested-proportions}
In the preceding chapter, I discussed scenarios where a dataset is broken into pieces defined by one cateogical variable, such as political party, company, or health status. It is not uncommon, however, that we want to drill down further and break down a dataset by multiple categorical variables at once. For example, in the case of parliamentary seats, we could be interested in the proportions of seats by party and by the gender of the representatives. Similarly, in the case of people's health status, we could ask how health status further breaks down by marital status. I refer to these scenarios as nested proportions, because each additional categorical variable that we add creates a finer subdivision of the data nested within the previous proportions. There are several suitable approaches to visualize such nested proportions, including mosaic plots, treemaps, and parallel sets.
## Nested proportions gone wrong
I will begin by demonstrating two flawed approaches to visualizing nested proportions. While these approaches may seem nonsensical to any experienced data scientist, I have seen them in the wild and therefore think they warrant discussion. Throughout this chapter, I will work with a dataset of 106 bridges in Pittsburgh. This dataset contains various pieces of information about the bridges, such as the material from which they are constructed (steel, iron, or wood) and the year when they were erected. Based on the year of erection, bridges are grouped into distinct categories, such as crafts bridges that were erected before 1870 and modern bridges that were erected after 1940.
Let's assume we want to visualize both the fraction of bridges made from steel, iron, or wood and the fraction that are crafts or modern. We might be tempted to do so by drawing a combined pie chart (Figure \@ref(fig:bridges-pie-wrong)). However, this visualization is not valid. All the slices in a pie chart must add up to 100%, and here the slices add up to 135%. We reach a total percentage in excess of 100% because we are double-counting bridges. Every bridge in the dataset is made of steel, iron, or wood, so these three slices of the pie already represent 100% of the bridges. Every crafts or modern bridge is also a steel, iron, or wood bridge, and hence is counted twice in the pie chart.
(ref:bridges-pie-wrong) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron) and by date of construction (crafts, before 1870, and modern, after 1940), shown as a pie chart. Numbers represent the percentages of bridges of a given type among all bridges. This figure is invalid, because the percentages add up to more than 100%. There is overlap between construction material and date of construction. For example, all modern bridges are made of steel, and the majority of crafts bridges are made of wood. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-pie-wrong, fig.width = 5, fig.asp = 0.7, fig.cap = '(ref:bridges-pie-wrong)'}
# crafts: before 1870
# emerging: 1870 -- 1889
# mature: 1890 -- 1939
# modern: after 1940
select(bridges, MATERIAL, ERECTED) %>%
table() %>%
reshape2::melt() %>%
rename(material = MATERIAL, erected = ERECTED, count = value) %>%
mutate(
material = case_when(
material == "IRON" ~ "iron",
material == "STEEL" ~ "steel",
material == "WOOD" ~ "wood"
),
erected = case_when(
erected == "CRAFTS" ~ "crafts",
erected == "EMERGING" ~ "emerging",
erected == "MATURE" ~ "mature",
erected == "MODERN" ~ "modern"
)
) %>%
group_by(erected) %>%
mutate(group_count = sum(count)) -> bridges_tidy
n_total <- sum(bridges_tidy$count)
bridges_erected <- filter(bridges_tidy, erected %in% c("crafts", "modern")) %>%
group_by(erected) %>%
summarize(
count = sum(count),
percent = round(100*count/n_total, 1)
) %>%
rename(type = erected)
bridges_material <- group_by(bridges_tidy, material) %>%
summarize(
count = sum(count),
percent = round(100*count/n_total, 1)
) %>%
rename(type = material)
bridges_material_erected <- rbind(bridges_material, bridges_erected) %>%
mutate(
type = factor(type, levels = c("steel", "wood", "iron", "modern", "crafts"))
) %>%
arrange(type)
bridges_pie <- bridges_material_erected %>%
mutate(
count_total = sum(count),
end_angle = 2*pi*cumsum(count)/count_total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1)
)
rpie = 1
rlabel = 1.05 * rpie
p_bridges_pie <- ggplot(bridges_pie) +
geom_arc_bar(
aes(
x0 = 0, y0 = 0, r0 = 0, r = rpie,
start = start_angle, end = end_angle, fill = type
),
color = "white", size = 0.5
) +
geom_text(
aes(
x = rlabel*sin(mid_angle),
y = rlabel*cos(mid_angle),
label = type,
hjust = hjust, vjust = vjust
),
family = dviz_font_family,
size = 14/.pt
) +
geom_text(
aes(
x = 0.6*sin(mid_angle),
y = 0.6*cos(mid_angle),
label = paste0(percent, "%")
),
family = dviz_font_family,
size = 12/.pt,
color = c("white", "white", "white", "black", "black")
) +
coord_fixed(clip = "off") +
scale_x_continuous(
limits = c(-1.5, 1.5), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_y_continuous(
limits = c(-1.15, 1.15), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_fill_manual(
values = c(iron = "#D55E00D0", wood = "#009E73D0", steel = "#0072B2D0",
crafts = "#F0E442D0", modern = "#56B4E9D0")
) +
theme_dviz_map() +
theme(legend.position = "none")
stamp_wrong(p_bridges_pie)
```
Double-counting is not necessarily a problem if we choose a visualization that does not require the proportions to add to 100%. As discussed in the preceding chapter, side-by-side bars meet this criterion. We can show the various proportions of bridges as bars in a single plot, and this plot is not technically wrong (Figure \@ref(fig:bridges-bars-bad)). Nevertheless, I have labeled it as "bad", because it does not immediately show that there is overlap among some of the categories shown. A casual observer might conclude from Figure \@ref(fig:bridges-bars-bad) that there are five separate categories of bridges, and that, for example, modern bridges are neither made of steel nor of wood or iron.
(ref:bridges-bars-bad) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron) and by date of construction (crafts, before 1870, and modern, after 1940), shown as a bar plot. Unlike Figure \@ref(fig:bridges-pie-wrong), this visualization is not technically wrong, since it doesn't imply that the bar heights need to add up to 100%. However, it also does not clearly indicate the overlap among different groups, and therefore I have labeled it "bad". Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-bars-bad, fig.cap = '(ref:bridges-bars-bad)'}
p_bridges_bars <- ggplot(bridges_material_erected) +
aes(type, percent, fill = type) +
geom_col() +
scale_y_continuous(
limits = c(0, 75),
expand = c(0, 0),
labels = function(x) paste0(x, "%"),
name = "proportion of bridges"
) +
scale_x_discrete(name = NULL) +
scale_fill_manual(
values = c(iron = "#D55E00D0", wood = "#009E73D0", steel = "#0072B2D0",
crafts = "#F0E442D0", modern = "#56B4E9D0")
) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none",
plot.margin = margin(3.5, 7, 3.5, 1.5)
)
stamp_bad(p_bridges_bars)
```
## Mosaic plots and treemaps
Whenever we have categories that overlap, it is best to show clearly how they relate to each other. This can be done with a mosaic plot (Figure \@ref(fig:bridges-mosaic)). On first glance, a mosaic plot looks similar to a stacked bar plot (e.g., Figure \@ref(fig:marketshare-stacked)). However, unlike in a stacked bar plot, in a mosaic plot both the heights and the widths of individual shaded areas vary. Note that in Figure \@ref(fig:bridges-mosaic), we see two additional construction eras, *emerging* (from 1870 to 1889) and *mature* (1890 to 1939). In combination with crafts and modern, these construction eras cover all bridges in the dataset, as do the three building materials. This is a critical condition for a mosaic plot: Every categorical variable shown must cover all the observations in the dataset.
(ref:bridges-mosaic) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron) and by era of construction (crafts, emerging, mature, modern), shown as a mosaic plot. The widths of each rectangle are proportional to the number of bridges constructed in that era, and the heights are proportional to the number of bridges constructed from that material. Numbers represent the counts of bridges within each category. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-mosaic, fig.cap = '(ref:bridges-mosaic)'}
select(bridges, MATERIAL, ERECTED) %>%
table() %>%
reshape2::melt() %>%
rename(material = MATERIAL, erected = ERECTED, count = value) %>%
mutate(
material = case_when(
material == "IRON" ~ "iron",
material == "STEEL" ~ "steel",
material == "WOOD" ~ "wood"
),
erected = case_when(
erected == "CRAFTS" ~ "crafts",
erected == "EMERGING" ~ "emerging",
erected == "MATURE" ~ "mature",
erected == "MODERN" ~ "modern"
)
) %>%
group_by(erected) %>%
mutate(group_count = sum(count)) -> bridges_tidy
labels_df <- group_by(bridges_tidy, erected) %>%
filter(count != 0) %>%
arrange(desc(material)) %>%
mutate(
y = (cumsum(count) - 0.5*count)/group_count,
y = ifelse(
erected == "mature" & material == "wood", NA, y
)
)
ggplot(bridges_tidy) +
aes(x = erected, y = count, width = group_count, fill = material) +
geom_bar(stat = "identity", position = "fill", colour = "white", size = 1) +
geom_text(
data = labels_df,
aes(y = y, label = count, color = material),
na.rm = TRUE,
size = 12/.pt,
family = dviz_font_family
) +
facet_grid(~erected, scales = "free_x", space = "free_x") +
scale_y_continuous(
name = NULL,
#breaks = NULL,
expand = c(0, 0),
breaks = filter(labels_df, erected == "crafts")$y,
labels = filter(labels_df, erected == "crafts")$material,
sec.axis = dup_axis(
breaks = filter(labels_df, erected == "modern")$y,
labels = filter(labels_df, erected == "modern")$material
)
) +
scale_x_discrete(
name = NULL
) +
scale_fill_manual(
values = c("#D55E00D0", "#0072B2D0", "#009E73D0"),
guide = "none"
) +
scale_color_manual(
values = c(iron = "white", wood = "white", steel = "white"),
guide = "none"
) +
coord_cartesian(clip = "off") +
theme_dviz_grid(rel_small = 1) +
theme(
line = element_blank(),
strip.text = element_blank(),
axis.ticks.length = unit(0, "pt"),
panel.spacing.x = unit(0, "pt")
)
```
To draw a mosaic plot, we begin by placing one categorical variable along the *x* axis (here, era of bridge construction) and subdivide the *x* axis by the relative proportions that make up the categories. We then place the other categorical variable along the *y* axis (here, building material) and, within each category along the *x* axis, subdivide the *y* axis by the relative proportions that make up the categories of the *y* variable. The result is a set of rectangles whose areas are proportional to the number of cases representing each possible combination of the two categorical variables.
The bridges dataset can also be visualized in a related but distinct format called a *treemap*. In a treemap, just as is the case in a mosaic plot, we take an enclosing rectangle and subdivide it into smaller rectangles whose areas represent the proportions. However, the method of placing the smaller rectangles into the larger one is different compared to the mosaic plot. In a treemap, we recursively nest rectangles inside each other. For example, in the case of the Pittsburgh bridges, we can first subdivide the total area into three parts representing the three building materials wood, iron, and steel. Then, we subdivide each of those areas further to represent the construction eras represented for each building material (Figure \@ref(fig:bridges-treemap)). In principle, we could keep going with nesting ever more smaller subdivisions inside each other, though relatively quickly the result would become unwieldy or confusing.
(ref:bridges-treemap) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron) and by era of construction (crafts, emerging, mature, modern), shown as a treemap. The area of each rectangle is proportional to the number of bridges of that type. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-treemap, fig.asp = 3/4, fig.cap = '(ref:bridges-treemap)'}
filcols <- c("#D55E00D0", "#0072B2D0", "#009E73D0")
filcols <- c(vapply(filcols, function(x) c(lighten(x, .9), lighten(x, .6), lighten(x, .3), x), character(4)))
ggplot(bridges_tidy, aes(area = count, subgroup = material, fill = interaction(erected, material))) +
geom_treemap(color = "white", size = 0.5*.pt, alpha = NA) +
geom_treemap_subgroup_text(
family = dviz_font_family,
colour = "grey50",
place = "centre", alpha = 0.7,
grow = TRUE
) +
geom_treemap_subgroup_border(color = "white") +
geom_treemap_text(
aes(label = erected, color = interaction(erected, material)),
family = dviz_font_family,
place = "centre",
grow = FALSE
) +
scale_fill_manual(values = filcols) +
scale_color_manual(values = c(
crafts.iron = "black", crafts.steel = "black", crafts.wood = "black",
emerging.iron = "black", emerging.steel = "black", emerging.wood = "black",
mature.iron = "black", mature.steel = "black", mature.wood = "black",
modern.iron = "white", modern.steel = "white", modern.wood = "white")
) +
coord_cartesian(clip = "off") +
guides(colour = "none", fill = "none")
```
While mosaic plots and treemaps are closely related, they have different points of emphasis and different application areas. Here, the mosaic plot (Figure \@ref(fig:bridges-mosaic)) emphasizes the temporal evolution in building-material use from the crafts era to the modern era, whereas the treemap (Figure \@ref(fig:bridges-treemap)) emphasizes the total number of steel, iron, and wood bridges.
More generally, mosaic plots assume that all of the proportions shown can be identified via combinations of two or more orthogonal categorical variables. For example, in Figure \@ref(fig:bridges-mosaic), every bridge can be described by a choice of building material (wood, iron, steel) and a choice of time period (crafts, emerging, mature, modern). Moreover, in principle every combination of these two variable is possible, even though in practice this need not be the case. (Here, there are no steel crafts bridges and no wood or iron modern bridges.) By contrast, such a requirement does not exist for treemaps. In fact, treemaps tend to work well when the proportions cannot meaningfully be described by combining multiple categorical variables. For example, we can separate the U.S. into four regions (West, Northeast, Midwest, and South) and each region into distinct states, but the states in one region have no relationship to the states in another region (Figure \@ref(fig:US-states-treemap)).
(ref:US-states-treemap) States in the U.S. visualized as a treemap. Each rectangle represents one state, and the area of each rectangle is proportional to the state's land surface area. The states are grouped into four regions, West, Northeast, Midwest, and South. The coloring is proportional to the number of inhabitants for each state, with darker colors representing larger numbers of inhabitants. Data source: 2010 U.S. Census
```{r US-states-treemap, fig.width = 8.5, fig.cap = '(ref:US-states-treemap)'}
population_df <- left_join(US_census, US_regions) %>%
group_by(region, division, state) %>%
summarize(
pop2000 = sum(pop2000, na.rm = TRUE),
pop2010 = sum(pop2010, na.rm = TRUE),
area = sum(area)
) %>%
ungroup() %>%
mutate(
state = factor(state, levels = state),
region = factor(region, levels = c("West", "South", "Midwest", "Northeast"))
)
## manually add colors
# hues
hues <- c(300, 50, 250, 100) # purple, brown, blue, green
hues <- c(50, 100, 250, 300) # brown, green, blue, purple
# minimum and maximum population density
minpop <- min(population_df$pop2010)
maxpop <- max(population_df$pop2010)
# turn pop density into color
population_df_color <- population_df %>%
mutate(index = as.numeric(factor(region))) %>%
group_by(index) %>%
mutate(
value = (pop2010-minpop)/(maxpop-minpop),
fill = scales::gradient_n_pal(
colorspace::sequential_hcl(
6,
h = hues[index[1]],
c = c(45, 20),
l = c(30, 80),
power = .5
)
)(1-value)
)
ggplot(population_df_color, aes(area = area, subgroup = region, fill = fill)) +
geom_treemap(color = "white", size = 0.5*.pt, alpha = NA) +
geom_treemap_subgroup_text(
family = dviz_font_family,
colour = "white",
place = "centre", alpha = 0.7,
grow = TRUE
) +
geom_treemap_subgroup_border(color = "white") +
geom_treemap_text(
aes(label = state),
color = "black",
family = dviz_font_family,
place = "centre",
grow = FALSE
) +
scale_fill_identity() +
coord_cartesian(clip = "off") +
guides(colour = "none", fill = "none")
```
Both mosaic plots and treemaps are commonly used and can be illuminating, but they have similar limitations as do stacked bars (Chapter \@ref(tab:pros-cons-pie-bar)): A direct comparison among conditions can be difficult, because different rectangles do not necessarily share baselines that enable visual comparison. In mosaic plots or treemaps, this problem is exacerbated by the fact that the shapes of the different rectangles can vary. For example, there are the same number of iron bridges (three) among the emerging and the mature bridges, but this is difficult to discern in the mosaic plot (Figure \@ref(fig:bridges-mosaic)), because the two rectangles representing these two groups of three bridges have entirely different shapes. There isn't necessarily a solution to this problem---visualizing nested proportions can be tricky. Whenever possible, I recommend showing the actual counts or percentages on the plot, so readers can verify that their intuitive interpretation of the shaded areas is correct.
## Nested pies
At the beginning of this chapter, I visualized the bridges dataset with a flawed pie chart (Figure \@ref(fig:bridges-pie-wrong)), and I then argued that a mosaic plot or a treemap are more appropriate. However, both of these latter plot types are closely related to pie charts, since they all use area to represent data values. The primary difference is the type of coordinate system, polar in the case of a pie chart versus cartesian in the case of a mosaic plot or treemap. This close relationship between these different plots begs the question whether some variant of a pie chart can be used to visualize this dataset.
There are two possibilities. First, we can draw a pie chart composed of an inner and an outer circle (Figure \@ref(fig:bridges-nested-pie)). The inner circle shows the breakdown of the data by one variable (here, building material) and the outer circle shows the breakdown of each slice of the inner circle by the second variable (here, era of bridge construction). This visualization is reasonable but I have my reservations, and therefore I have labeled it "ugly". Most importantly, the two separate circles obscure the fact that each bridge in the dataset has both a building material and an era of bridge construction. In effect, in Figure \@ref(fig:bridges-nested-pie), we are still double-counting each bridge. If we add up all the numbers shown in the two circles we obtain 212, which is twice the number of bridges in the dataset.
(ref:bridges-nested-pie) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron, inner circle) and by era of construction (crafts, emerging, mature, modern, outer circle). Numbers represent the counts of bridges within each category. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-nested-pie, fig.width = 5, fig.asp = 0.7, fig.cap = '(ref:bridges-nested-pie)'}
bridges_arranged <-
ungroup(bridges_tidy) %>%
mutate(material = factor(material, levels = c("wood", "iron", "steel"))) %>%
arrange(material)
bridges_pie_outer <- bridges_arranged %>%
mutate(
count_total = sum(count),
end_angle = 2*pi*cumsum(count)/count_total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1),
type = erected,
label = paste0(erected, " (", material, ")"),
angle_off = ifelse(
label == "emerging (wood)", -0.0175/2,
ifelse(label == "mature (wood)", 2*0.0175, 0)
)
) %>%
#filter(erected %in% c("crafts", "modern"), count != 0)
filter(count != 0)
bridges_pie_inner <- bridges_arranged %>%
group_by(material) %>%
summarize(count = sum(count)) %>%
mutate(
count_total = sum(count),
end_angle = 2*pi*cumsum(count)/count_total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1),
type = material
)
rpie <- 1
rpie1 <- 0.6
rpie2 <- 1
rlabel <- 1.02 * rpie
bridges_nested_pie <- ggplot() +
geom_arc_bar(data = bridges_pie_outer,
aes(
x0 = 0, y0 = 0, r0 = rpie1, r = rpie2,
start = start_angle, end = end_angle, fill = type
),
color = "white", size = 0.5
) +
geom_arc_bar(data = bridges_pie_inner,
aes(
x0 = 0, y0 = 0, r0 = 0, r = rpie1,
start = start_angle, end = end_angle, fill = type
),
color = "white", size = 0.5
) +
geom_text(data = bridges_pie_outer,
aes(
x = rlabel*sin(mid_angle + angle_off),
y = rlabel*cos(mid_angle + angle_off),
label = label,
hjust = hjust, vjust = vjust
),
family = dviz_font_family,
size = 12/.pt
) +
geom_text(data = bridges_pie_outer,
aes(
x = 0.78*sin(mid_angle),
y = 0.78*cos(mid_angle),
label = count
),
family = dviz_font_family,
size = 10/.pt,
hjust = 0.5, vjust = 0.5
) +
geom_text(data = bridges_pie_inner,
aes(
x = 0.32*sin(mid_angle),
y = 0.32*cos(mid_angle),
label = count
),
family = dviz_font_family,
size = 10/.pt,
hjust = 0.5, vjust = 0.5
) +
coord_fixed(clip = "off") +
scale_x_continuous(
limits = c(-1.5, 1.8), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_y_continuous(
limits = c(-1.15, 1.15), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_fill_manual(
values = c(iron = "#D55E00D0", wood = "#009E73D0", steel = "#0072B2D0",
crafts = "#F0E442D0", modern = "#56B4E9D0", emerging = "#E69F00D0",
mature = "#CC79A7D0")
) +
theme_dviz_map() +
theme(legend.position = "none")
stamp_ugly(bridges_nested_pie)
```
Alternatively, we can first slice the pie into pieces representing the proportions according to one variable (e.g. material) and then subdivide these slices further according to the other variable (construction era) (Figure \@ref(fig:bridges-nested-pie2)). In this way, in effect we are making a normal pie chart with a large number of small pie slices. However, we can then use coloring to indicate the nested nature of the pie. In Figure \@ref(fig:bridges-nested-pie2), green colors represent wood bridges, orange colors represent iron bridges, and blue colors represent steel bridges. The darkness of each color represents the construction era, with darker colors corresponding to more recently constructed bridges. By using a nested color scale in this way, we can visualize the breakdown of the data both by the primary variable (construction material) and by the secondary variable (construction era).
(ref:bridges-nested-pie2) Breakdown of bridges in Pittsburgh by construction material (steel, wood, iron) and by era of construction (crafts, emerging, mature, modern). Numbers represent the counts of bridges within each category. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-nested-pie2, fig.width = 5, fig.asp = 0.7, fig.cap = '(ref:bridges-nested-pie2)'}
rpie <- 1
rpie1 <- 0
rpie2 <- 1
rlabel <- 1.02 * rpie
bridges_nested_pie2 <- ggplot() +
geom_arc_bar(data = bridges_pie_outer,
aes(
x0 = 0, y0 = 0, r0 = rpie1, r = rpie2,
start = start_angle, end = end_angle, fill = label
),
color = "white", size = 0.5
) +
geom_text(data = bridges_pie_outer,
aes(
x = rlabel*sin(mid_angle + angle_off),
y = rlabel*cos(mid_angle + angle_off),
label = label,
hjust = hjust, vjust = vjust
),
family = dviz_font_family,
size = 12/.pt
) +
geom_text(data = bridges_pie_outer,
aes(
x = 0.6*sin(mid_angle),
y = 0.6*cos(mid_angle),
label = count
),
color = c(rep("black", 8), "white"),
family = dviz_font_family,
size = 10/.pt,
hjust = 0.5, vjust = 0.5
) +
coord_fixed(clip = "off") +
scale_x_continuous(
limits = c(-1.5, 1.8), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_y_continuous(
limits = c(-1.15, 1.15), expand = c(0, 0), name = "", breaks = NULL, labels = NULL
) +
scale_fill_manual(
values = c(
`crafts (wood)` = lighten("#009E73D0", .9),
`emerging (wood)` = lighten("#009E73D0", .6),
`mature (wood)` = lighten("#009E73D0", .3),
`crafts (iron)` = lighten("#D55E00D0", .9),
`emerging (iron)` = lighten("#D55E00D0", .6),
`mature (iron)` = lighten("#D55E00D0", .3),
`emerging (steel)` = lighten("#0072B2D0", .6),
`mature (steel)` = lighten("#0072B2D0", .3),
`modern (steel)` = "#0072B2D0"
)
) +
theme_dviz_map() +
theme(legend.position = "none")
bridges_nested_pie2
```
The pie chart of Figure \@ref(fig:bridges-nested-pie2) represents a reasonable visualization of the bridges dataset, but in a direct comparison to the equivalent treemap (Figure \@ref(fig:bridges-treemap)) I think the treemap is preferable. First, the rectangular shape of the treemap allows it to make better use of the available space. Figures \@ref(fig:bridges-treemap) and \@ref(fig:bridges-nested-pie2) are of exactly equal size, but in Figure \@ref(fig:bridges-nested-pie2) much of the figure is wasted as white space. Figure \@ref(fig:bridges-treemap), the treemap, has virtually no superfluous white space. This matters because it enables me to place the labels inside the shaded areas in the treemap. Inside labels always create a stronger visual unit with the data than outside labels and hence are preferred. Second, some of the pie slices in Figure \@ref(fig:bridges-nested-pie2) are very thin and thus hard to see. By contrast, every rectangle in Figure \@ref(fig:bridges-treemap) is of a reasonable size.
## Parallel sets
When we want to visualize proportions described by more than two categorical variables, mosaic plots, treemaps, and pie charts all can quickly become unwieldy. A viable alternative in this case can be a *parallel sets plot*. In a parallel sets plot, we show how the total dataset breaks down by each individual categorical variable, and then we draw shaded bands that show how the subgroups relate to each other. See Figure \@ref(fig:bridges-parallel-sets1) for an example. In this figure, I have broken down the bridges dataset by construction material (iron, steel, wood), length of each bridge (long, medium, short), the era during which each bridge was constructed (crafts, emerging, mature, modern), and the river each bridge spans (Allegheny, Monongahela, Ohio). The bands that connect the parallel sets are colored by construction material. This shows, for example, that wood bridges are mostly of medium length (with a few short bridges), were primarily erected during the crafts period (with a few bridges of medium length erected during the emerging and mature periods), and span primarily the Allegheny river (with a few crafts bridges spanning the Monongahela river). By contrast, iron bridges are all of medium length, were primarily erected during the crafts period, and span the Allegheny and Monongahela rivers in approximately equal proportions.
(ref:bridges-parallel-sets1) Breakdown of bridges in Pittsburgh by construction material, length, era of construction, and the river they span, shown as a parallel sets plot. The coloring of the bands highlights the construction material of the different bridges. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-parallel-sets1, fig.cap = '(ref:bridges-parallel-sets1)'}
select(bridges, MATERIAL, ERECTED, RIVER, LENGTH) %>%
filter(RIVER != "Y") %>%
table() %>%
reshape2::melt() %>%
rename(material = MATERIAL, erected = ERECTED, length = LENGTH, river = RIVER, count = value) %>%
mutate(
material = case_when(
material == "IRON" ~ "iron",
material == "STEEL" ~ "steel",
material == "WOOD" ~ "wood"
),
erected = case_when(
erected == "CRAFTS" ~ "crafts",
erected == "EMERGING" ~ "emerging",
erected == "MATURE" ~ "mature",
erected == "MODERN" ~ "modern"
),
length = case_when(
length == "LONG" ~ "long",
length == "MEDIUM" ~ "medium",
length == "SHORT" ~ "short"
),
river = case_when(
river == "A" ~ "Allegheny",
river == "M" ~ "Monongahela",
river == "O" ~ "Ohio"
)
) -> data
data <- gather_set_data(data, 1:4)
data$x <- factor(data$x, levels = c("material", "length", "erected", "river"))
ggplot(data, aes(x, id = id, split = y, value = count)) +
geom_parallel_sets(aes(fill = material), alpha = 0.5, axis.width = 0.13) +
geom_parallel_sets_axes(axis.width = 0.1, fill = "grey80", color = "grey80") +
geom_parallel_sets_labels(
color = 'black',
family = dviz_font_family,
size = 10/.pt,
angle = 90
) +
scale_x_discrete(
name = NULL,
expand = c(0, 0.2)
) +
scale_y_continuous(breaks = NULL, expand = c(0, 0))+
scale_fill_manual(
values = c(iron = "#D55E00D0", wood = "#009E73D0", steel = "#0072B2D0"),
guide = "none"
) +
theme_dviz_open() +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(14, 1.5, 2, 1.5)
)
```
The same visualization looks quite different if we color by a different criterion, for example by river (Figure \@ref(fig:bridges-parallel-sets2)). This figure is visually busy, with many criss-crossing bands, but we do see that nearly any bridge of any type can be found to span each river.
(ref:bridges-parallel-sets2) Breakdown of bridges in Pittsburgh by construction material, length, era of construction, and the river they span. This figure is similar to Figure \@ref(fig:bridges-parallel-sets1) but now the coloring of the bands highlights the river spanned by the different bridges. This figure is labeled "ugly" because the arrangement of the colored bands in the middle of the figure is very busy, and also because the bands need to be read from right to left. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-parallel-sets2, fig.cap = '(ref:bridges-parallel-sets2)'}
data$x <- factor(data$x, levels = c("material", "length", "erected", "river"))
p_ugly <- ggplot(data, aes(x, id = id, split = y, value = count)) +
geom_parallel_sets(aes(fill = river), alpha = 0.5, axis.width = 0.13) +
geom_parallel_sets_axes(axis.width = 0.1, fill = "grey80", color = "grey80") +
geom_parallel_sets_labels(
color = 'black',
family = dviz_font_family,
size = 10/.pt,
angle = 90
) +
scale_x_discrete(
name = NULL,
expand = c(0, 0.2)
) +
scale_y_continuous(breaks = NULL, expand = c(0, 0))+
scale_fill_manual(
values = c("#F0E442D0", "#56B4E9D0", "#CC79A7D0"),
guide = "none"
) +
theme_dviz_open() +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(14, 1.5, 2, 1.5)
)
stamp_ugly(p_ugly)
```
I have labeled Figure \@ref(fig:bridges-parallel-sets2) as "ugly" because I think it is overly complex and confusing. First, since we are used to reading from left to right I think the sets that define the coloring should appear all the way to the left, not on the right. This will make it easier to see where the coloring originates and how it flows through the dataset. Second, it is a good idea to change the order of the sets such that the amount of criss-crossing bands is minimized. Following these principles, I arrive at Figure \@ref(fig:bridges-parallel-sets3), which I consider preferable to Figure \@ref(fig:bridges-parallel-sets2).
(ref:bridges-parallel-sets3) Breakdown of bridges in Pittsburgh by river, era of construction, length, and construction material. This figure differs from Figure \@ref(fig:bridges-parallel-sets2) only in the order of the parallel sets. However, the modified order results in a figure that is easier to read and less busy. Data source: Yoram Reich and Steven J. Fenves, via the UCI Machine Learning Repository [@UCI_repo_2017]
```{r bridges-parallel-sets3, fig.cap = '(ref:bridges-parallel-sets3)'}
#data$x <- factor(data$x, levels = c("river", "length", "material", "erected"))
data$x <- factor(data$x, levels = rev(c("material", "length", "erected", "river")))
data$x <- factor(data$x, levels = c("river", "length", "material", "erected"))
ggplot(data, aes(x, id = id, split = y, value = count)) +
geom_parallel_sets(aes(fill = river), alpha = 0.5, axis.width = 0.13) +
geom_parallel_sets_axes(axis.width = 0.1, fill = "grey80", color = "grey80") +
geom_parallel_sets_labels(
color = 'black',
family = dviz_font_family,
size = 10/.pt,
angle = 90
) +
scale_x_discrete(
name = NULL,
expand = c(0, 0.2)
) +
scale_y_continuous(breaks = NULL, expand = c(0, 0))+
scale_fill_manual(
values = c("#F0E442D0", "#56B4E9D0", "#CC79A7D0"),
guide = "none"
) +
theme_dviz_open() +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(14, 1.5, 2, 1.5)
)
```