visualizing_trends.Rmd

```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")

library(lubridate)
library(tidyr)
```

# Visualizing trends {#visualizing-trends}

When making scatter plots (Chapter \@ref(visualizing-associations)) or time series (Chapter \@ref(time-series)), we are often more interested in the overarching trend of the data than in the specific detail of where each individual data point lies. By drawing the trend on top of or instead of the actual data points, usually in the form of a straight or curved line, we can create a visualization that helps the reader immediately see key features of the data. There are two fundamental approaches to determining a trend: We can either smooth the data by some method, such as a moving average, or we can fit a curve with a defined functional form and then draw the fitted curve. Once we have identified a trend in a dataset, it may also be useful to look specifically at deviations from the trend or to separate the data into multiple components, including the underlying trend, any existing cyclical components, and episodic components or random noise.

## Smoothing

Let us consider a time series of the Dow Jones Industrial Average (Dow Jones for short), a stock-market index representing the price of 30 large, publicly owned U.S. companies. Specifically, we will look at the year 2009, right after the 2008 crash (Figure \@ref(fig:dow-jones)). During the tail end of the crash, in the first three months of the year 2009,  the market lost over 2400 points (~27%). Then it slowly recovered for the remainder of the year. How can we visualize these longer-term trends while de-emphasizing the less important short-term fluctuations?

(ref:dow-jones) Daily closing values of the Dow Jones Industrial Average for the year 2009. Data source: Yahoo! Finance

```{r dow-jones, fig.width = 5*6/4.2, fig.asp = 0.5, fig.cap = '(ref:dow-jones)'}
dow_jones_industrial %>% filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10")) %>%
  ggplot(aes(date, close)) + 
  geom_line(color = "grey20", size = 0.75) + 
  scale_x_date(limits = c(ymd("2008-12-31"), ymd("2010-01-10")), expand = c(0, 0)) +
  xlab(NULL) + ylab("Dow Jones Industrial Average") +
  theme_dviz_grid() +
  theme(
    plot.margin = margin(3, 12, 3, 1.5)
  )
```

In statistical terms, we are looking for a way to *smooth* the stock-market time series. The act of smoothing produces a function that captures key patterns in the data while removing irrelevant minor detail or noise. Financial analysts usually smooth stock-market data by calculating moving averages. To generate a moving average, we take a time window, say the first 20 days in the time series, calculate the average price over these 20 days, then move the time window by one day, so it now spans the 2nd to 21st day, calculate the average over these 20 days, move the time window again, and so on. The result is a new time series consisting of a sequence of averaged prices.

To plot this sequence of moving averages, we need to decide which specific time point to associate with the average for each time window. Financial analysts often plot each average at the end of its respective time window. This choice results in curves that lag the original data (Figure \@ref(fig:dow-jones-moving-ave)a), with more severe lags corresponding to larger averaging time windows. Statisticians, on the other hand, plot the average at the center of the time window, which results in a curve that overlays perfectly on the original data (Figure \@ref(fig:dow-jones-moving-ave)b).

(ref:dow-jones-moving-ave) Daily closing values of the Dow Jones Industrial Average for the year 2009, shown together with their 20-day, 50-day, and 100-day moving averages. (a) The moving averages are plotted at the end of the moving time windows. (b) The moving averages are plotted in the center of the moving time windows. Data source: Yahoo! Finance

```{r dow-jones-moving-ave, fig.width = 5*6/4.2, fig.asp = 1, fig.cap = '(ref:dow-jones-moving-ave)'}
p1 <- dow_jones_industrial %>% filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10")) %>%
  mutate(
    close_20d_ave = moving_ave(date, close, 20, center = FALSE),
    close_50d_ave = moving_ave(date, close, 50, center = FALSE),
    close_100d_ave = moving_ave(date, close, 100, center = FALSE)
  ) %>%
  ggplot(aes(date, close)) + 
  geom_line(color = "grey20", size = .35) +
  geom_line(aes(date, close_20d_ave, color = "20d"), size = 1, na.rm = TRUE) +
  geom_line(aes(date, close_50d_ave, color = "50d"), size = 1, na.rm = TRUE) +
  geom_line(aes(date, close_100d_ave, color = "100d"), size = 1, na.rm = TRUE) +
  scale_color_manual(
    values = c(
      `20d` = "#009e73",
      `50d` = "#d55e00",
      `100d` = "#0072b2"
    ),
    breaks = c("20d", "50d", "100d"),
    labels = c("20-day average", "50-day average", "100-day average"),
    name = NULL
  ) + 
  scale_x_date(
    limits = c(ymd("2008-12-31"), ymd("2010-01-10")), expand = c(0, 0),
    labels = NULL
  ) +
  xlab(NULL) + ylab("Dow Jones Industrial Average") +
  theme_dviz_grid() +
  theme(
    plot.margin = margin(3, 12, 3, 1.5),
    legend.position = c(1, 0),
    legend.justification = c(1, 0),
    legend.box.background = element_rect(fill = "white", color = NA),
    legend.box.margin = margin(6, 12, 0, 12),
    axis.ticks.x = element_blank()
  )

p2 <- dow_jones_industrial %>% filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10")) %>%
  mutate(
    close_20d_ave = moving_ave(date, close, 20, center = TRUE),
    close_50d_ave = moving_ave(date, close, 50, center = TRUE),
    close_100d_ave = moving_ave(date, close, 100, center = TRUE)
  ) %>%
  ggplot(aes(date, close)) + 
  geom_line(color = "grey20", size = .35) +
  geom_line(aes(date, close_20d_ave, color = "20d"), size = 1, na.rm = TRUE) +
  geom_line(aes(date, close_50d_ave, color = "50d"), size = 1, na.rm = TRUE) +
  geom_line(aes(date, close_100d_ave, color = "100d"), size = 1, na.rm = TRUE) +
  scale_color_manual(
    values = c(
      `20d` = "#009e73",
      `50d` = "#d55e00",
      `100d` = "#0072b2"
    ),
    breaks = c("20d", "50d", "100d"),
    labels = c("20-day average", "50-day average", "100-day average"),
    name = NULL,
    guide = "none"
  ) + 
  scale_x_date(limits = c(ymd("2008-12-31"), ymd("2010-01-10")), expand = c(0, 0)) +
  xlab(NULL) + ylab("Dow Jones Industrial Average") +
  theme_dviz_grid() +
  theme(
    plot.margin = margin(3, 12, 3, 1.5)
  )

plot_grid(p1, p2, ncol = 1, align = 'h', labels = "auto")
```

Regardless of whether we plot the smoothed time series with or without lag, we can see that the length of the time window over which we average sets the scale of the fluctuations that remain visible in the smoothed curve. The 20-day moving average only removes small, short-term spikes but otherwise follows the daily data closely. The 100-day moving average, on the other hand, removes even fairly substantial drops or spikes that play out over a time span of multiple weeks. For example, the massive drop to below 7000 points in the first quarter of 2009 is not visible in the 100-day moving average, which replaces it with a gentle curve that doesn't dip much below 8000 points (Figure \@ref(fig:dow-jones-moving-ave)). Similarly, the drop around July 2009 is completely invisible in the 100-day moving average.

The moving average is the most simplistic approach to smoothing, and it has some obvious limitations. First, it results in a smoothed curve that is shorter than the original curve (Figure \@ref(fig:dow-jones-moving-ave)). Parts are missing at either the beginning or the end or both. And the more the time series is smoothed (i.e., the larger the averaging window), the shorter the smoothed curve. Second, even with a large averaging window, a moving average is not necessarily that smooth. It may exhibit small bumps and wiggles even though larger-scale smoothing has been achieved (Figure \@ref(fig:dow-jones-moving-ave)). These wiggles are caused by individual data points that enter or exit the averaging window. Since all data points in the window are weighted equally, individual data points at the window boundaries can have visible impact on the average.

Statisticians have developed numerous approaches to smoothing that alleviate the downsides of moving averages. These approaches are much more complex and computationally costly, but they are readily available in modern statistical computing environments. One widely used method is LOESS (locally estimated scatterplot smoothing, @Cleveland1979), which fits low-degree polynomials to subsets of the data. Importantly, the points in the center of each subset are weighted more heavily than points at the boundaries, and this weighting scheme yields a much smoother result than we get from a weighted average (Figure \@ref(fig:dow-jones-loess)). The LOESS curve shown here looks similar to the 100-day average, but this similarity should not be overinterpreted. The smoothness of a LOESS curve can be tuned by adjusting a parameter, and different parameter choices would have produced LOESS curves looking more like the 20-day or 50-day average. 

(ref:dow-jones-loess) Comparison of LOESS fit to 100-day moving average for the Dow Jones data of Figure \@ref(fig:dow-jones-moving-ave). The overall trend shown by the LOESS smooth is nearly identical to the 100-day moving average, but the LOESS curve is much smoother and it extends to the entire range of the data. Data source: Yahoo! Finance

```{r dow-jones-loess, fig.width = 5*6/4.2, fig.asp = 0.5, fig.cap = '(ref:dow-jones-loess)'}
# LOESS (locally estimated scatterplot smoothing) 

dow_jones_industrial %>% filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10")) %>%
  mutate(
    close_100d_ave = moving_ave(date, close, 100)
  ) %>%
  ggplot(aes(date, close)) + 
  geom_line(color = "grey20", size = .35) +
  geom_line(aes(date, close_100d_ave, color = "100d"), size = 1, na.rm = TRUE) +
  geom_smooth(aes(color = "smooth"), size = 1, na.rm = TRUE, se = FALSE) +
  scale_color_manual(
    values = c(
      `100d` = "#d55e00",
      smooth = "#0072b2"
    ),
    breaks = c("smooth", "100d"),
    labels = c("LOESS smoother", "100-day average"),
    name = NULL
  ) + 
  scale_x_date(limits = c(ymd("2008-12-31"), ymd("2010-01-10")), expand = c(0, 0)) +
  xlab(NULL) + ylab("Dow Jones Industrial Average") +
  theme_dviz_grid() +
  theme(
    legend.position = c(1, 0.48),
    legend.justification = c(1, 0.5),
    legend.box.background = element_rect(fill = "white", color = NA),
    legend.box.margin = margin(0, 12, 6, 12),
    plot.margin = margin(3, 12, 3, 1.5)
  )
```

Importantly, LOESS is not limited to time series. It can be applied to arbitrary scatter plots, as is apparent from its name, *locally estimated scatterplot smoothing*. For example, we can use LOESS to look for trends in the relationship between a car's fuel-tank capacity and its price (Figure \@ref(fig:tank-capacity-loess)). The LOESS line shows that tank capacity grows approximately linearly with price for cheap cars (below \$20,000) but it levels off for more expensive cars. Above a price of approximately \$20,000, buying a more expensive car will not get you one with a larger fuel tank.

(ref:tank-capacity-loess) Fuel-tank capacity versus price of 93 cars released for the 1993 model year. Each dot corresponds to one car. The solid line represents a LOESS smooth of the data. We see that fuel-tank capacity increases approximately linearly with price, up to a price of approximately $20,000, and then it levels off. Data source: Robin H. Lock, St. Lawrence University

```{r tank-capacity-loess, fig.width = 5, fig.asp = 0.75, fig.cap='(ref:tank-capacity-loess)'}
cars93 <- MASS::Cars93

ggplot(cars93, aes(x = Price, y = Fuel.tank.capacity)) + 
  geom_point(color = "grey60") + 
  geom_smooth(se = FALSE, method = "loess", formula = y ~ x, color = "#0072B2") +
  scale_x_continuous(
    name = "price (USD)",
    breaks = c(20, 40, 60),
    labels = c("$20,000", "$40,000", "$60,000")
  ) +
  scale_y_continuous(name = "fuel-tank capacity\n(US gallons)") +
  theme_dviz_grid()
```

LOESS is a very popular smoothing approach because it tends to produce results that look right to the human eye. However, it requires the fitting of many separate regression models. This makes it slow for large datasets, even on modern computing equipment.

As a faster alternative to LOESS, we can use spline models. A spline is a piecewise polynomial function that is highly flexible yet always looks smooth. When working with splines, we will encounter the term *knot.* The knots in a spline are the endpoints of the individual spline segments. If we fit a spline with *k* segments, we need to specify *k* + 1 knots. While spline fitting is computationally efficient, in particular if the number of knots is not too large, splines have their own downsides. Most importantly, there is a bewildering array of different types of splines, including cubic splines, B-splines, thin-plate splines, Gaussian process splines, and many others, and which one to pick may not be obvious. The specific choice of the type of spline and number of knots used can result in widely different smoothing functions for the same data (Figure \@ref(fig:tank-capacity-smoothers)).

(ref:tank-capacity-smoothers) Different smoothing models display widely different behaviors, in particular near the boundaries of the data. (a) LOESS smoother, as in Figure \@ref(fig:tank-capacity-loess). (b) Cubic regression splines with 5 knots. (c) Thin-plate regression spline with 3 knots. (d)  Gaussian process spline with 6 knots. Data source: Robin H. Lock, St. Lawrence University

```{r tank-capacity-smoothers, fig.width = 5.5*6/4.2, fig.asp = 0.75, fig.cap='(ref:tank-capacity-smoothers)'}
cars_base <- ggplot(cars93, aes(x = Price, y = Fuel.tank.capacity)) + geom_point(color = "grey60") + 
  scale_x_continuous(
    name = "price (USD)",
    breaks = c(20, 40, 60),
    labels = c("$20,000", "$40,000", "$60,000")
  ) +
  scale_y_continuous(name = "fuel-tank capacity\n(US gallons)") +
  theme_dviz_grid(12)  
  
p1 <- cars_base + geom_smooth(se = FALSE, method = "loess", formula = y ~ x, color = "#0072B2")
p2 <- cars_base + geom_smooth(se = FALSE, method = "gam", formula = y ~ s(x, k = 5, bs = 'cr'), color = "#0072B2")
p3 <- cars_base + geom_smooth(se = FALSE, method = "gam", formula = y ~ s(x, k = 3), color = "#0072B2")
p4 <- cars_base + geom_smooth(se = FALSE, method = "gam", formula = y ~ s(x, k = 6, bs = 'gp'), color = "#0072B2")

plot_grid(
  p1, NULL, p2,
  NULL, NULL, NULL,
  p3, NULL, p4,
  align = 'hv',
  labels = c("a", "", "b", "", "", "", "c", "", "d"),
  rel_widths = c(1, .02, 1),
  rel_heights = c(1, .02, 1)
)
```

Most data visualization software will provide smoothing features, likely implemented as either a type of local regression (such as LOESS) or a type of spline. The smoothing method may be referred to as a GAM, a generalized additive model, which is a superset of all these types of smoothers. It is important to be aware that the output of the smoothing feature is highly dependent on the specific GAM model that is fit. Unless you try out a number of different choices you may never realize to what extent the results you see depend on the specific default choices made by your statistical software.

```{block type='rmdtip', echo=TRUE}
Be careful when interpreting the results from a smoothing function. The same dataset can be smoothed in many different ways.
```


## Showing trends with a defined functional form

As we can see in Figure \@ref(fig:tank-capacity-smoothers), the behavior of general-purpose smoothers can be somewhat unpredictable for any given dataset. These smoothers also do not provide parameter estimates that have a meaningful interpretation. Therefore, whenever possible, it is preferable to fit a curve with a specific functional form that is appropriate for the data and that uses parameters with clear meaning. 

For the fuel-tank data, we need a curve that initially rises linearly but then levels off at a constant value. The function $y = A - B \exp(-mx)$ may fit that bill. Here, $A$, $B$, and $m$ are the constants we adjust to fit the curve to the data. The function is approximately linear for small $x$, with $y \approx A - B + B m x$, it approaches a constant value for large $x$, $y \approx A$, and it is strictly increasing for all values of $x$. Figure \@ref(fig:tank-capacity-model) shows that this equation fits the data at least as well as any of the smoothers we considered previously (Figure \@ref(fig:tank-capacity-smoothers)).

(ref:tank-capacity-model) Fuel-tank data represented with an explicit analytical model. The solid line corresponds to a least-squares fit of the formula $y = A - B \exp(-mx)$ to the data. Fitted parameters are $A = 19.6$, $B = 29.2$, $m = 0.00015$.  Data source: Robin H. Lock, St. Lawrence University

```{r tank-capacity-model, fig.width = 5, fig.asp = 0.75, fig.cap = '(ref:tank-capacity-model)'}
# first model
fit.out <- nls(
  Fuel.tank.capacity ~ a*Price/(Price + b) + c,
  data = cars93,
  start = c(a = -45, b = -1, c = 70)
)

# second model
fit.out <- nls(
  Fuel.tank.capacity ~ A1 - A0*exp(-m*Price),
  data = cars93,
  start = c(A0 = 29.249, A1 = 19.621, m = 0.149),
  control = nls.control(maxiter = 1000, warnOnly = TRUE)
)

fit.df <- data.frame(
  Price = 7:62,
  Fuel.tank.capacity = predict(fit.out, data.frame(Price = 7:62))
)

ggplot(cars93, aes(x = Price, y = Fuel.tank.capacity)) + 
  geom_point(color = "grey60") + 
  geom_line(data = fit.df, size = 1, color = "#0072B2") +
  #stat_function(fun = function(x) -24.22+1.38*x  + 22) +
  scale_x_continuous(
    name = "price (USD)",
    breaks = c(20, 40, 60),
    labels = c("$20,000", "$40,000", "$60,000")
  ) +
  scale_y_continuous(name = "fuel-tank capacity\n(US gallons)") +
  theme_dviz_grid()
```

A functional form that is applicable in many different contexts is the simple straight line, $y = A + mx$. Approximately linear relationships between two variables are surprisingly common in real-world datasets. For example, in Chapter \@ref(visualizing-associations), I discussed the relationship between head length and body mass in blue jays. This relationship is approximately linear, for both female and male birds, and drawing linear trend lines on top of the points in a scatter plot helps the reader perceive the trends (Figure \@ref(fig:blue-jays-scatter-line)).

(ref:blue-jays-scatter-line) Head length versus body mass for 123 blue jays. The birds' sex is indicated by color. This figure is equivalent to Figure \@ref(fig:blue-jays-scatter-sex), except that now we have drawn linear trend lines on top of the individual data points. Data source: Keith Tarvin, Oberlin College

```{r blue-jays-scatter-line, fig.width = 5, fig.asp = 3.2/4, fig.cap='(ref:blue-jays-scatter-line)'}
ggplot(blue_jays, aes(Mass, Head, color = KnownSex, fill = KnownSex)) + 
  geom_point(pch = 21, color = "white", size = 2.5) +
  geom_smooth(method = "lm", size = 0.75, se = FALSE, fullrange = TRUE) +
  scale_x_continuous(name = "body mass (g)") +
  scale_y_continuous(name = "head length (mm)") +
  scale_fill_manual(
    values = c(F = "#D55E00C0", M = "#0072B2C0"),
    breaks = c("F", "M"),
    labels = c("female birds   ", "male birds"),
    name = NULL,
    guide = guide_legend(
      direction = "horizontal",
      override.aes = list(size = 3, linetype = 0)
    )
  ) +
  scale_color_manual(
    values = c(F = "#D55E00", M = "#0072B2"),
    breaks = c("F", "M"),
    labels = c("female birds   ", "male birds"),
    name = NULL
  ) +
  theme_dviz_grid() +
  theme(
    #legend.position = c(1, 0.01),
    #legend.justification = c(1, 0),
    legend.position = "top",
    legend.justification = "right",
    legend.box.spacing = unit(3.5, "pt"), # distance between legend and plot
    legend.text = element_text(vjust = 0.6),
    legend.spacing.x = unit(2, "pt"),
    legend.background = element_rect(fill = "white", color = NA),
    legend.key.width = unit(10, "pt")
  )
```

When the data display a non-linear relationship, we need to guess what an appropriate functional form might be. In this case, we can assess the accuracy of our guess by transforming the axes in such a way that a linear relationship emerges. To demonstrate this principle, let's return to the monthly submissions to the preprint server bioRxiv, discussed in Chapter \@ref(visualizing-associations). If the increase in submissions in each month is proportional to the number of submissions in the previous month, i.e., if submissions grow by a fixed percentage each month, then the resulting curve is exponential. This assumption seems to be met for the bioRxiv data, because a curve with exponential form, $y = A\exp(mx)$, fits the bioRxiv submission data well (Figure \@ref(fig:biorxiv-expfit)).

(ref:biorxiv-expfit) Monthly submissions to the preprint server bioRxiv. The solid blue line represents the actual monthly preprint counts and the dashed black line represents an exponential fit to the data, $y = 60\exp[0.77(x - 2014)]$. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-expfit, fig.cap = '(ref:biorxiv-expfit)'}
preprint_growth %>% filter(archive == "bioRxiv") %>%
  filter(count > 0) %>%
  mutate(date_dec = decimal_date(date)) -> biorxiv_growth

expfit.out <- nls(
  count ~ a*exp(b*(date_dec-2014)),
  data = biorxiv_growth,
  start = c(a = 60.004, b = .773)
)

linfit.out <- nls(
  log(count) ~ log(a) + b*(date_dec-2014),
  data = biorxiv_growth,
  start = c(a = 42.576, b = .878)
)

date_seq = seq(min(biorxiv_growth$date_dec), max(biorxiv_growth$date_dec), by = 0.1)
expfit.df <- data.frame(
  date_dec = date_seq,
  count = predict(expfit.out, data.frame(date_dec = date_seq))
)

linfit.df <- data.frame(
  date_dec = date_seq,
  count = exp(predict(linfit.out, data.frame(date_dec = date_seq)))
)

ggplot(biorxiv_growth, aes(date_dec, count)) + 
  geom_line(data = expfit.df, aes(color = "expfit"), size = .5, linetype = 2) +
  geom_point(aes(fill = "expfit"), shape = NA, na.rm = TRUE) + # dummy for legend
  geom_line(aes(color = "data"), size = .5) +
  geom_point(aes(fill = "data"), color = "white", shape = 21, size = 2) +
  scale_y_continuous(
     limits = c(0, 1550),
     breaks = c(0, 500, 1000, 1500),
     expand = c(0, 0),
     name = "preprints / month"
  ) + 
  scale_x_continuous(name = NULL) +
  scale_color_manual(
    name = NULL,
    values = c(data = "#0072B2", expfit = "black"),
    breaks = c("data", "expfit"),
    labels = c("actual counts", "exponential fit"),
    guide = guide_legend(
      override.aes = list(
        color = c("white", "black"),
        shape = c(21, NA),
        size = c(2, 0.5),
        linetype = c(0, 2)
      )
    )
  ) +
  scale_fill_manual(
    name = NULL,
    values = c(data = "#0072B2", expfit = "black"),
    breaks = c("data", "expfit"),
    labels = c("actual counts", "exponential fit")
  ) +
  theme_dviz_open() +
  theme(
    legend.position = c(.05, 1),
    legend.justification = c(0, 1),
    legend.spacing.x = unit(3, "pt"),
    legend.title = element_blank(), 
    plot.margin = margin(7, 7, 3, 1.5)
  )
```

If the original curve is exponential, $y = A\exp(mx)$, then a log-transformation of the *y* values will turn it into a linear relationship, $\log(y) = \log(A) + mx$. Therefore, plotting the data with log-transformed *y* values (or equivalently, with a logarithmic *y* axis) and looking for a linear relationship is a good way of determining whether a dataset exhibits exponential growth. For the bioRxiv submission numbers, we indeed obtain a linear relationship when using a logarithmic *y* axis (Figure \@ref(fig:biorxiv-logscale)).

(ref:biorxiv-logscale) Monthly submissions to the preprint server bioRxiv, shown on a log scale. The solid blue line represents the actual monthly preprint counts, the dashed black line represents the exponential fit from Figure \@ref(fig:biorxiv-expfit), and the solid black line represents a linear fit to log-transformed data, corresponding to $y = 43\exp[0.88(x - 2014)]$. Data source: Jordan Anaya, http://www.prepubmed.org/

```{r biorxiv-logscale, fig.cap = '(ref:biorxiv-logscale)'}
ggplot(biorxiv_growth, aes(date_dec, count)) +
  geom_line(data = expfit.df, aes(color = "expfit"), size = .5, linetype = 2) +
  geom_point(aes(fill = "expfit"), shape = NA, na.rm = TRUE) + # dummy for legend
  geom_line(data = linfit.df, aes(color = "linfit"), size = .5) +
  geom_point(aes(fill = "linfit"), shape = NA, na.rm = TRUE) + # dummy for legend
  geom_line(aes(color = "data"), size = .5) +
  geom_point(aes(fill = "data"), color = "white", shape = 21, size = 2) +
  scale_y_log10(
    limits = c(30, 1670),
    breaks = c(10*(3:9), 100*(1:9), 1000*(1:2)),
    labels = c(
      "", "", "50", "", "", "", "", "100", "", "", "", "500",
      "", "", "", "", "1000", ""
    ),
    expand = c(0, 0),
    name = "preprints / month"
  ) + 
  scale_x_continuous(name = NULL) +
  scale_color_manual(
    name = NULL,
    values = c(data = "#0072B2", expfit = "black", linfit = "black"),
    breaks = c("data", "expfit", "linfit"),
    labels = c("actual counts", "exponential fit", "linear fit, log-transformed data"),
    guide = guide_legend(
      override.aes = list(
        color = c("white", "black", "black"),
        shape = c(21, NA, NA),
        size = c(2, 0.5, 0.5),
        linetype = c(0, 2, 1)
      )
    )
  ) +
  scale_fill_manual(
    name = NULL,
    values = c(data = "#0072B2", expfit = "black", linfit = "black"),
    breaks = c("data", "expfit", "linfit"),
    labels = c("actual counts", "exponential fit", "linear fit, log-transformed data")
  ) +
  theme_dviz_open() +
  theme(
    legend.position = c(.05, 1),
    legend.justification = c(0, 1),
    legend.spacing.x = unit(3, "pt"),
    legend.title = element_blank(), 
    plot.margin = margin(7, 7, 3, 1.5)
  )

```

In Figure \@ref(fig:biorxiv-logscale), in addition to the actual submission counts, I am also showing the exponential fit from Figure \@ref(fig:biorxiv-expfit) and a linear fit to the log-transformed data. These two fits are similar but not identical. In particular, the slope of the dashed line seems somewhat off. The line systematically falls above the individual data points for half the time series. This is a common problem with exponential fits: The square deviations from the data points to the fitted curve are so much larger for the largest data values than for the smallest data values that the deviations of the smallest data values contribute little to the overall sum squares that the fit minimizes. As a result, the fitted line systematically overshoots or undershoots the smallest data values. For this reason, I generally advise to avoid exponential fits and instead use linear fits on log-transformed data.

```{block type='rmdtip', echo=TRUE}
It is usually better to fit a straight line to transformed data than to fit a nonlinear curve to untransformed data.
```

A plot such as Figure \@ref(fig:biorxiv-logscale) is commonly referred to as *log--linear*, since the *y* axis is logarithmic and the *x* axis is linear. Other plots we may encounter include *log--log*, where both the *y* and the *x* axis are logarithmic, or *linear--log*, where *y* is linear and *x* is logarithmic. In a log--log plot, power laws of the form $y \sim x^\alpha$ appear as straight lines (see Figure \@ref(fig:word-counts-tail-log-log) for an example), and in a linear--log plot, logarithmic relationships of the form $y \sim \log(x)$ appear as a straight lines. Other functional forms can be turned into linear relationships with more specialized coordinate transformations, but these three (log--linear, log--log, linear--log) cover a wide range of real-world applications.


## Detrending and time-series decomposition

For any time series with a prominent long-term trend, it may be useful to remove this trend to specifically highlight any notable deviations. This technique is called *detrending*, and I will demonstrate it here with house prices. In the U.S., the mortgage lender Freddie Mac publishes a monthly index, called the *Freddie Mac House Price Index,* that tracks the change in house prices over time. The index attempts to capture the state of the entire house market in a given region, such that an increase in the index by, for example, 10% can be interpreted as an average house price increase of 10% in the respective market. The index is arbitrarily set to a value of 100 in December 2000.

Over long periods of time, house prices tend to display consistent annual growth, approximately in line with inflation. However, overlaid on top of this trend are housing bubbles that lead to severe boom and bust cycles. Figure \@ref(fig:hpi-trends) shows the actual house price index and its long-term trend for four select U.S. states. We see that between 1980 and 2017, California underwent two bubbles, one in 1990 and one in the mid-2000s. During the same period, Nevada experienced only one bubble, in the mid-2000s, and house prices in Texas and West Virginia closely followed their long-term trends the entire time. Because house prices tend to grow in percent increments, i.e., exponentially, I have chosen a logarithmic *y* axis in Figure \@ref(fig:hpi-trends). The straight lines correspond to a 4.7% annual price increase in California and a 2.8% annual price increase each in Nevada, Texas, and West Virginia. 

(ref:hpi-trends) Freddie Mac House Price Index from 1980 through 2017, for four selected states (California, Nevada, Texas, and West Virginia). The House Price Index is a unitless number that tracks relative house prices in the chosen geographic region over time. The index is scaled arbitrarily such that it equals 100 in December of the year 2000. The blue lines show the monthly fluctuations in the index and the straight gray lines show the long-term price trends in the respective states. Note that the *y* axes are logarithmic, so that the straight gray lines represent consistent exponential growth. Data source: Freddie Mac House Prices Index

```{r hpi-trends, fig.width = 5*6/4.2, fig.cap = '(ref:hpi-trends)'}
hpi_trends <- house_prices %>%
  filter(year(date) >= 1980) %>%
  filter(state %in% c("California", "Nevada", "West Virginia", "Texas")) %>%
  mutate(
    date_numeric = as.numeric(date),
    hpi = house_price_index,
    log_hpi = log(hpi)
  ) %>%
  group_by(state) %>%
  mutate(
    hpi_trend = {
      coefs <- coef(lm(log_hpi ~ date_numeric))
      exp(coefs[1] + coefs[2]*date_numeric)
    },
    hpi_detrended = hpi/hpi_trend
  )

ggplot(hpi_trends, aes(date, hpi)) +
  geom_line(aes(y = hpi_trend), color = "grey50", size = 0.4) +
  geom_line(color = "#0072B2", size = 0.75) +
  scale_x_date(name = NULL) +
  scale_y_log10(name = "House Price Index (Dec. 2000 = 100)") +
  facet_wrap(~state, scales = "free_x") +
  theme_dviz_hgrid() +
  theme(
    strip.text = element_text(size = 12),
    strip.background = element_rect(fill = "grey85"),
    axis.line.x = element_line(color = "grey50"),
    axis.ticks.x = element_line(color = "grey50"),
    axis.ticks.y = element_blank(),
    axis.text.y = element_text(margin = margin(0, 0, 0, 0))
  )
    
```

We detrend housing prices by dividing the actual price index at each time point by the respective value in the long-term trend. Visually, this division will look like we are subtracting the gray lines from the blue lines in Figure \@ref(fig:hpi-trends), because a division of the untransformed values is equivalent to a subtraction of the log-transformed values. The resulting detrended house prices show the housing bubbles more clearly (Figure \@ref(fig:hpi-detrended)), as the detrending emphasizes the unexpected movements in a time series. For example, in the original time series, the decline in home prices in California from 1990 to about 1998 looks modest (Figure \@ref(fig:hpi-trends)). However, during that same time period, on the basis of the long-term trend we would have expected prices to rise. Relative to the expected rise the drop in prices was substantial, amounting to 25% at the lowest point (Figure \@ref(fig:hpi-detrended)).

(ref:hpi-detrended) Detrended version of the Freddie Mac House Price Index shown in Figure \@ref(fig:hpi-trends). The detrended index was calculated by dividing the actual index (blue lines in Figure \@ref(fig:hpi-trends)) by the expected value based on the long-term trend (straight gray lines in Figure \@ref(fig:hpi-trends)). This visualization shows that California experienced two housing bubbles, around 1990 and in the mid-2000s, identifiable from a rapid rise and subsequent decline in the actual housing prices relative to what would have been expected from the long-term trend. Similarly, Nevada experienced one housing bubble, in the mid-2000s, and neither Texas nor West Virginia experienced much of a bubble at all. Data source: Freddie Mac House Prices Index

```{r hpi-detrended, fig.width = 5*6/4.2, fig.cap = '(ref:hpi-detrended)'}
ggplot(hpi_trends, aes(date, hpi_detrended)) +
  geom_hline(yintercept = 1, color = "grey50", size = 0.4) +
  geom_line(color = "#0072B2", size = 0.75) +
  scale_x_date(name = NULL) +
  scale_y_log10(
    name = "House Price Index (detrended)",
    breaks = c(0.752, 1, 1.33, 1.77),
    labels = c("0.75", "1.00", "1.33", "1.77")
  ) +
  facet_wrap(~state, scales = "free_x") +
  theme_dviz_hgrid() +
  theme(
    strip.text = element_text(size = 12),
    strip.background = element_rect(fill = "grey85"),
    axis.line.x = element_line(color = "grey50"),
    axis.ticks.x = element_line(color = "grey50"),
    axis.ticks.y = element_blank(),
    axis.text.y = element_text(margin = margin(0, 0, 0, 0))
  )
 
 
```

Beyond simple detrending, we can also separate a time series into multiple distinct components, such that their sum recovers the original time series. In general, in addition to a long-term trend, there are three distinct components that may shape a time series. First, there is random noise, which causes small, erratic movements up and down. This noise is visible in all the time series shown in this chapter, but maybe the most clearly in Figure \@ref(fig:biorxiv-logscale). Second, there can be unique external events that leave their mark in the time series, such as the distinct housing bubbles seen in Figure \@ref(fig:hpi-trends). Third, there can be cyclical variations. For example, outside temperatures show daily cyclical variations. The highest temperatures are reached in the early afternoon and the lowest temperatures in the early morning. Outside temperatures also show yearly cyclical variations. They tend to rise in the spring, reach their maximum in the summer, and then decline in fall and reach their minimum in the winter (Figure \@ref(fig:temperature-normals-Houston)).

To demonstrate the concept of distinct time-series components, I will here decompose the Keeling curve, which shows changes in CO<sub>2</sub> abundance over time (Figure \@ref(fig:keeling-curve)). CO<sub>2</sub> is measured in parts per million (ppm). We see a long-term increase in CO<sub>2</sub> abundance that is slightly faster than linear, from below 325 ppm in the 1960s to above 400 in the second decade of the 21st century (Figure \@ref(fig:keeling-curve)). CO<sub>2</sub> abundance also fluctuates annually, following a consistent up-and-down pattern overlaid on top of the overall increase. The annual fluctuation are driven by plant growth in the northern hemisphere. Plants consume CO<sub>2</sub> during photosynthesis. Because most of the globe's land masses are located in the northern hemisphere, and plant growth is most active in the spring and summer, we see an annual global decline in atmospheric CO<sub>2</sub> that coincides with the summer months in the northern hemisphere.

(ref:keeling-curve) The Keeling curve. The Keeling curve shows the change of CO<sub>2</sub> abundance in the atmosphere over time. Since 1958, CO<sub>2</sub> abundance has been continuously monitored at the Mauna Loa Observatory in Hawaii, initially under the direction of Charles Keeling. Shown here are monthly average CO<sub>2</sub> readings, expressed in parts per million (ppm). The CO<sub>2</sub> readings fluctuate annually with the seasons but show a consistent long-term trend of increase. Data source: Dr. Pieter Tans, NOAA/ESRL, and Dr. Ralph Keeling, Scripps Institution of Oceanography

```{r keeling-curve, fig.cap = '(ref:keeling-curve)'}
# use complete years only
ggplot(CO2, aes(date_dec, co2_interp)) +
  geom_line(color = "#0072B2", size = 0.6) +
  scale_y_continuous(
    limits = c(295, 418),
    breaks = c(300, 325, 350, 375, 400),
    name = parse(text = "`CO`[2]*` concentration (ppm)`"),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    limits = c(1958, 2019),
    name = NULL,
    breaks = c(1960, 1970, 1980, 1990, 2000, 2010),
    labels = c("", "1970", "", "1990", "", "2010"),
    expand = c(0, 0)
  ) +
  theme_dviz_grid()
```

We can decompose the Keeling curve into its long-term trend, seasonal fluctuations, and remainder (Figure \@ref(fig:keeling-curve-decomposition)). The specific method I am using here is called STL (Seasonal decomposition of Time series by LOESS, @Cleveland_et_al_1990), but there are many other methods that achieve similar goals. The decomposition shows that over the last three decades, CO<sub>2</sub> abundance has increased by over 50 ppm. By comparison, seasonal fluctuations amount to less than 8 ppm (they never cause an increase or a decrease in more than 4 ppm relative to the long-term trend), and the remainder amounts to less than 1.6 ppm (Figure \@ref(fig:keeling-curve-decomposition)). The remainder is the difference between the actual readings and the sum of the long-term trend and the seasonal fluctuations, and here it corresponds to random noise in the monthly CO<sub>2</sub> readings. More generally, however, the remainder could also capture unique external events. For example, if a massive volcano eruption released substantial amounts of CO<sub>2</sub>, such an event might be visible as a sudden spike in the remainder. Figure \@ref(fig:keeling-curve-decomposition) shows that no such unique external events have had a major effect on the Keeling curve in recent decades. 

(ref:keeling-curve-decomposition) Time-series decomposition of the Keeling curve, showing the monthly average (as in Figure \@ref(fig:keeling-curve)), the long-term trend, seasonal fluctuations, and the remainder. The remainder is the difference between the actual readings and the sum of the long-term trend and the seasonal fluctuations, and it represents random noise. I have zoomed into the most recent 30 years of data to more clearly show the shape of the annual fluctuations. Data source: Dr. Pieter Tans, NOAA/ESRL, and Dr. Ralph Keeling, Scripps Institution of Oceanography

```{r keeling-curve-decomposition, fig.width = 5*6/4.2, fig.asp = 0.9, fig.cap = '(ref:keeling-curve-decomposition)'}
# use complete years only
CO2_complete <- filter(CO2, year >= 1959, year < 2018)
# convert to time series object
CO2_ts <- ts(data = CO2_complete$co2_interp, start = 1959, end = c(2017, 12), frequency = 12)
# detrend via STL method
CO2_stl <- stl(CO2_ts, s.window = 7)
CO2_detrended <- mutate(
  CO2_complete,
  `monthly average` = co2_interp,
  `seasonal fluctuations` = t(CO2_stl$time.series)[1, ],
  `long-term trend` = t(CO2_stl$time.series)[2, ],
  remainder = t(CO2_stl$time.series)[3, ]
)

facet_labels <- c("monthly average", "long-term trend", "seasonal fluctuations", "remainder")

CO2_detrended %>%
  select(date_dec, `monthly average`, `seasonal fluctuations`, `long-term trend`, remainder) %>%
  gather(variable, value, -date_dec) %>%
  mutate(variable = factor(variable, levels = facet_labels)) %>%
  filter(date_dec >= 1989) %>%
  ggplot(aes(date_dec, value)) +
  geom_line(color = "#0072B2", size = 0.6) +
  geom_point(
    data = data.frame(
      variable = factor(
        rep(facet_labels, each = 2),
        levels = facet_labels
      ),
      x = 1990,
      #y = c(295, 424, 295, 424, -4.1, 4.3, -.81, .85)
      y = c(324, 419, 324, 419, -4.1, 4.3, -.81, .85)
    ),
    aes(x, y),
    color = NA,
    na.rm = TRUE
  ) +
  scale_y_continuous(
    name = parse(text = "`CO`[2]*` concentration (ppm)`"),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    limits = c(1989, 2018.2),
    name = NULL,
    breaks = c(1990, 1995, 2000, 2005, 2010, 2015),
    labels = c("1990", "", "2000", "", "2010", ""),
    expand = c(0, 0)
  ) +
  facet_wrap(facets = vars(variable), ncol = 1, scales = "free") +
  theme_dviz_grid() +
  theme(
    plot.margin = margin(3, 1.5, 3, 1.5),
    strip.text = element_text(size = 12)
  )
```