Can Symptoms Surveys Improve COVID-19 Forecasts?

#symptom surveys #forecasting #COVIDcast #R
Can Symptoms Surveys Improve COVID-19 Forecasts?

Ryan Tibshirani

Outline

    In our past two posts, we wrote about the COVID-19 symptom surveys that Delphi runs through Facebook and Google. These surveys have asked millions of people in the United States whether they (or people they know) are experiencing COVID-like symptoms, allowing us to calculate a “% CLI-in-community” signal for counties across the United States: an estimate of the percentage of people who know someone who is currently sick with COVID-like illness. Because these surveys run daily and aren’t subject to the reporting delays and lag that can affect other data, such as COVID test results, they promise to be a valuable tool to monitor the spread of COVID-19. This post offers a deeper dive into empirical analysis than our past posts about the surveys, examining whether the % CLI-in-community indicators from our two surveys can be used to improve the accuracy of short-term forecasts of county-level COVID-19 case rates.

    Forecasting has long been a primary initiative of the Delphi research group (in the past for flu, and currently for COVID-19). Each week since mid-July we’ve been submitting forecasts to the COVID Forecast Hub, which serves as the official data source for the CDC’s communications on COVID-19 forecasts.

    At the outset, we should state that this post is neither a report on Delphi’s current COVID-19 forecasters nor an authoritative take on cutting-edge COVID-19 forecasting. Instead, our purpose here is to study the Facebook and Google % CLI-in-community signals, and demonstrate their value, when used as features, to add predictive power beyond what we can achieve with (fairly simple) time series models trained on case rates alone. In a future blog post, we’ll follow up with details on our “production” forecasters.

    While we focus here on forecasting, there are a number of other important prediction problems that arise as part of the pandemic response. For example, we also work on hotspot detection, where the goal is to predict whether case rates will rise significantly, rather than to predict the future case rates directly, as in forecasting. To motivate why we might be optimistic about the utility of incorporating our survey signals into forecasting or hotspot detection models, you can check out our previous exploratory investigations, which suggested that they can serve as early indicators of COVID-19 activity. Stay tuned to the Delphi blog for a post on hotspot detection soon.

    Next we explain the forecasting setup and results in detail. Some parts may get a bit technical, but we’ve tried to make the post accessible enough that you can catch the main points even if you skip the technical details.

    Problem Setup

    Formally, our goal here is to predict county-level COVID-19 case incidence rates, 1 and 2 weeks ahead. Specifically, we wish to predict the number of new COVID-19 cases per capita, both over the next 1-7 days and over the next 8-14 days. One convenient (and equivalent) way to phrase this problem is to predict the smoothed COVID-19 case incidence rate 7 days ahead and 14 days ahead, where the smoothing is performed via a 7-day trailing average. We restrict our attention to the 440 counties that had at least 200 confirmed cases by May 14, 2020 (the end of the Google survey data) and in which both the Facebook and Google % CLI-in-community signals are available (there were 604 counties in total with at least 200 confirmed cases by May 14, and we dropped 164 of them due to a lack of Facebook or Google survey data).

    To fix notation, let \(Y_{\ell,t}\) denote the smoothed COVID-19 case incidence rate for location (county) \(\ell\) and time (day) \(t\). Let \(F_{\ell,t}\) and \(G_{\ell,t}\) denote the Facebook and Google % CLI-in-community signals, respectively, for location \(\ell\) and time \(t\). (We rescale all these signals from their given values in our API so that they are true proportions: between 0 and 1.) We evaluate the following four models:

    \[ \begin{aligned} h(Y_{\ell,t+d}) &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) \\ h(Y_{\ell,t+d}) &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) + \sum_{j=0}^2 \gamma_j h(F_{\ell,t-7j}) \\ h(Y_{\ell,t+d}) &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) + \sum_{j=0}^2 \gamma_j h(G_{\ell,t-7j}) \\ h(Y_{\ell,t+d}) &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) + \sum_{j=0}^2 \gamma_j h(F_{\ell,t-7j}) + \sum_{j=0}^2 \tau_j h(G_{\ell,t-7j}). \end{aligned} \]

    Here \(d=7\) or \(d=14\), depending on the target value (number of days we predict ahead), and \(h\) is a transformation to be specified later.

    Informally, the first model, which we’ll call the “Cases” model, bases its predictions of future case rates on the following three features: current COVID-19 case rates, and those 1 and 2 weeks back. The second model, “Cases + Facebook”, additionally incorporates the current Facebook signal, and the Facebook signal from 1 and 2 weeks back. The third model, “Cases + Google”, is exactly the same but substitutes the Google signal instead of the Facebook one. Finally, the fourth model, “Cases + Facebook + Google”, uses both Facebook and Google signals. For each model, in order to make a forecast at time \(t_0\) (to predict case rates at time \(t_0+d\)), we fit a linear model using least absolute deviations (LAD) regression, training over all locations \(\ell\) (all 440 counties), and all time \(t\) that are within the most recent 14 days of data available up to and including time \(t_0\).

    Forecasts are transformed back to the original scale (we apply \(h^{-1}\) to the predictions from the fitted LAD model), and denoted \(\hat{Y}_{\ell,t_0+d}\). For an error metric, we consider scaled absolute error (or just scaled error for short):

    \[ \frac{|\hat{Y}_{\ell,t_0+d} - Y_{\ell,t_0+d}|} {|Y_{\ell,t_0} - Y_{\ell,t_0+d}|}, \]

    where the error in the denominator is the error of the “strawman” model, which for any target always simply predicts the most recent available case rate.

    This normalization helps for two reasons. First, it gives us an interpretable scale, as we can understand the scaled error as a fraction improvement over the strawman’s error (so numbers like 0.8 or 0.9 would be favorable, and numbers like 2 or 5 or 10 would be increasingly disastrous). Second, in our forecasting problem, there turns out to a considerable amount of county-to-county variability in forecasting difficulty, and normalizing by the strawman’s error helps adjust for this (so that the aggregate results aren’t dominated by county-to-county differences).

    Transformations

    We investigated three transformations \(h\): identity, log, and logit (the latter two being common variance-stabilizing transforms for proportions). The results in all three cases were quite similar and the qualitative conclusions don’t change at all (the code below supports all three, so you can check this for yourself). For brevity, we’ll just show the results for the logit transform (actually, a “padded” version \(h(x) = \log\left(\frac{x+a}{1-x+a}\right)\), where the numerator and denominator are pushed away from zero by a small constant, which we took to be \(a=0.01\)).

    Forecasting Code

    The code below marches the forecast date \(t_0\) forward, one day at a time (from April 11 to September 1), fits the four models, makes predictions 7 and 14 days ahead, and records errors. It takes a little while to run1, the culprit being LAD regression: the training sets in our forecasting problem get moderately large (aggregating the data over 440 counties and 14 days results in over 6000 training samples), and at this scale LAD regression is much slower than least squares regression. We ran this R code separately and saved the results in an RData file; you can find this in the same GitHub repo as the Rmd source for this blog post.

    library(covidcast)
    library(dplyr)
    library(tidyr)
    
    #### Functions #####
    
    # Function to append shift values (lags or leads) to data frame
    append_shifts = function(df, shifts) {
      # Make sure that we have a complete record of dates for each geo_value (fill
      # with NAs as necessary)
      df_all = df %>% group_by(geo_value) %>%
        summarize(time_value = seq.Date(as.Date(min(time_value)),
                                        as.Date(max(time_value)),
                                        by = "day")) %>% ungroup()
      df = full_join(df, df_all, by = c("geo_value", "time_value"))
    
      # Group by geo value, sort rows by increasing time
      df = df %>% group_by(geo_value) %>% arrange(time_value)
    
      # Load over shifts, and add lag value or lead value
      for (shift in shifts) {
        fun = ifelse(shift < 0, lag, lead)
        varname = sprintf("value%+d", shift)
        df = mutate(df, !!varname := fun(value, n = abs(shift)))
      }
    
      # Ungroup and return
      return(ungroup(df))
    }
    
    # Some useful functions for transformations
    Log = function(x, a = 0.01) log(x + a)
    Exp = function(y, a = 0.01) exp(y) - a
    Logit = function(x, a = 0.01) log((x + a) / (1 - x + a))
    Sigmd = function(y, a = 0.01) (exp(y) * (1 + a) - a) / (1 + exp(y))
    Id = function(x) x
    
    #### Parameters #####
    
    # Transforms to consider, in what follows
    trans = Logit
    inv_trans = Sigmd
    
    # Rescale factors for our signals: bring them all down to proportions (between
    # 0 and 1)
    rescale_g = 1e-2 # Originally a percentage
    rescale_f = 1e-2 # Originally a percentage
    rescale_c = 1e-5 # Originally a count per 100,000 people
    
    n = 14 # Number of trailing days to use for training set
    lp_solver = "glpk" # LP solver to use in quantile_lasso()
    verbose = TRUE # Print intermediate progress to console?
    
    #### Data #####
    
    # Consider only counties with at least 200 cumulative cases by Google's end
    case_num = 200
    geo_values = covidcast_signal("jhu-csse", "confirmed_cumulative_num",
                                  "2020-05-14", "2020-05-14") %>%
      filter(value >= case_num) %>% pull(geo_value)
    
    # Fetch county-level Google and Facebook % CLI-in-community signals, and JHU
    # confirmed case incidence proportion
    start_day = "2020-04-11"
    end_day = "2020-09-01"
    g = covidcast_signal("google-survey", "smoothed_cli") %>%
      filter(geo_value %in% geo_values) %>%
      select(geo_value, time_value, value)
    f = covidcast_signal("fb-survey", "smoothed_hh_cmnty_cli",
                         start_day, end_day) %>%
      filter(geo_value %in% geo_values) %>%
      select(geo_value, time_value, value)
    c = covidcast_signal("jhu-csse", "confirmed_7dav_incidence_prop",
                         start_day, end_day) %>%
      filter(geo_value %in% geo_values) %>%
      select(geo_value, time_value, value)
    
    # Find "complete" counties, present in all three data signals at all times
    geo_values_complete = intersect(intersect(g$geo_value, f$geo_value),
                                    c$geo_value)
    
    # Filter to complete counties, transform the signals, append 1-2 week lags to
    # all three, and also 1-2 week leads to case rates
    lags = 1:2 * -7
    leads = 1:2 * 7
    g = g %>% filter(geo_value %in% geo_values_complete) %>%
      mutate(value = trans(value * rescale_g)) %>%
      append_shifts(shifts = lags)
    f = f %>% filter(geo_value %in% geo_values_complete) %>%
      mutate(value = trans(value * rescale_f)) %>%
      append_shifts(shifts = lags)
    c = c %>% filter(geo_value %in% geo_values_complete) %>%
      mutate(value = trans(value * rescale_c)) %>%
      append_shifts(shifts = c(lags, leads))
    
    # Rename columns
    colnames(g) = sub("^value", "goog", colnames(g))
    colnames(f) = sub("^value", "fb", colnames(f))
    colnames(c) = sub("^value", "case", colnames(c))
    
    # Make one big matrix by joining these three data frames
    z = full_join(full_join(g, f, by = c("geo_value", "time_value")),
                  c, by = c("geo_value", "time_value"))
    
    ##### Analysis #####
    
    # Use quantgen for LAD regression (this package supports quantile regression and
    # more; you can find it on GitHub here: https://github.com/ryantibs/quantgen)
    library(quantgen)
    
    res_list = vector("list", length = length(leads))
    
    # Loop over lead, forecast dates, build models and record errors (warning: this
    # computation takes a while)
    for (i in 1:length(leads)) {
      lead = leads[i]; if (verbose) cat("***", lead, "***\n")
    
      # Create a data frame to store our forecast results. Code below populates its
      # rows in a way that breaks from typical dplyr operations, done for efficiency
      res_list[[i]] = z %>%
        filter(between(time_value, as.Date(start_day) - min(lags) + lead,
                       as.Date(end_day) - lead)) %>%
        select(geo_value, time_value) %>%
        mutate(err0 = as.double(NA), err1 = as.double(NA), err2 = as.double(NA),
               err3 = as.double(NA), err4 = as.double(NA), lead = lead)
      valid_dates = unique(res_list[[i]]$time_value)
    
      for (k in 1:length(valid_dates)) {
        date = valid_dates[k]; if (verbose) cat(format(date), "... ")
    
        # Filter down to training set and test set
        z_tr = z %>% filter(between(time_value, date - lead - n, date - lead))
        z_te = z %>% filter(time_value == date)
        inds = which(res_list[[i]]$time_value == date)
    
        # Create training and test responses
        y_tr = z_tr %>% pull(paste0("case+", lead))
        y_te = z_te %>% pull(paste0("case+", lead))
    
        # Strawman model
        if (verbose) cat("0")
        y_hat = z_te %>% pull(case)
        res_list[[i]][inds,]$err0 = abs(inv_trans(y_hat) - inv_trans(y_te))
    
        # Cases only model
        if (verbose) cat("1")
        x_tr_case = z_tr %>% select(starts_with("case") & !contains("+"))
        x_te_case = z_te %>% select(starts_with("case") & !contains("+"))
        x_tr = x_tr_case; x_te = x_te_case # For symmetry wrt what follows
        ok = complete.cases(x_tr, y_tr)
        if (sum(ok) > 0) {
          obj = quantile_lasso(as.matrix(x_tr[ok,]), y_tr[ok], tau = 0.5,
                               lambda = 0, lp_solver = lp_solver)
          y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
          res_list[[i]][inds,]$err1 = abs(inv_trans(y_hat) - inv_trans(y_te))
        }
    
        # Cases and Facebook model
        if (verbose) cat("2")
        x_tr_fb = z_tr %>% select(starts_with("fb"))
        x_te_fb = z_te %>% select(starts_with("fb"))
        x_tr = cbind(x_tr_case, x_tr_fb)
        x_te = cbind(x_te_case, x_te_fb)
        ok = complete.cases(x_tr, y_tr)
        if (sum(ok) > 0) {
          obj = quantile_lasso(as.matrix(x_tr[ok,]), y_tr[ok], tau = 0.5,
                               lambda = 0, lp_solver = lp_solver)
          y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
          res_list[[i]][inds,]$err2 = abs(inv_trans(y_hat) - inv_trans(y_te))
        }
    
        # Cases and Google model
        if (verbose) cat("3")
        x_tr_goog = z_tr %>% select(starts_with("goog"))
        x_te_goog = z_te %>% select(starts_with("goog"))
        x_tr = cbind(x_tr_case, x_tr_goog)
        x_te = cbind(x_te_case, x_te_goog)
        ok = complete.cases(x_tr, y_tr)
        if (sum(ok) > 0) {
          obj = quantile_lasso(as.matrix(x_tr[ok,]), y_tr[ok], tau = 0.5,
                               lambda = 0, lp_solver = lp_solver)
          y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
          res_list[[i]][inds,]$err3 = abs(inv_trans(y_hat) - inv_trans(y_te))
        }
    
        # Cases, Facebook, and Google model
        if (verbose) cat("4\n")
        x_tr = cbind(x_tr_case, x_tr_fb, x_tr_goog)
        x_te = cbind(x_te_case, x_te_fb, x_te_goog)
        ok = complete.cases(x_tr, y_tr)
        if (sum(ok) > 0) {
          obj = quantile_lasso(as.matrix(x_tr[ok,]), y_tr[ok], tau = 0.5,
                               lambda = 0, lp_solver = lp_solver)
          y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
          res_list[[i]][inds,]$err4 = abs(inv_trans(y_hat) - inv_trans(y_te))
        }
      }
    }
    
    # Bind results over different leads into one big data frame, and save
    res = do.call(rbind, res_list)
    save(list = ls(), file = "demo.rda")

    Results: All Four Models

    We first compare the results across all four models. For this analysis, we filter down to common forecast dates available for the four models (to set an even footing for the comparison), which ends up being May 6 through May 14 for 7-day-ahead forecasts, and only May 13 through May 14 for 14-day-ahead forecasts. (The reason for this shortened period: we paused running the Google survey on May 14 so its data ends there, but as we explained in our last post, we plan to bring it back later this fall.) Hence we skip studying the 14-day-ahead forecasts results in this four-way model discussion, as they’re only based on 2 days of test data.

    Below we compute and print the median scaled errors for each of the four models over the 9-day test period (recall that the scaled error is the absolute error of the model’s forecast relative to that of the strawman; and each test day actually comprises 440 forecasts, over the 440 counties being considered). We can see that adding either or both of the survey signals improves on the median scaled error of the model that uses cases only, with the biggest gain achieved by the “Cases + Google” model. We can also see that the median scaled errors are all close to 1 (with all but that from “Cases + Google” model exceeding 1), which speaks to the difficulty of the forecasting problem.

    load("forecast-demo/demo.rda")
    library(dplyr)
    library(tidyr)
    library(ggplot2)
    
    model_names = c("Cases", "Cases + Facebook", "Cases + Google",
                    "Cases + Facebook + Google")
    
    # Restrict to common period for all 4 models, then calculate the scaled errors
    # for each model, that is, the error relative to the strawman's error
    res_all4 = res %>%
      drop_na() %>%                                       # Restrict to common time
      mutate(err1 = err1 / err0, err2 = err2 / err0,      # Compute relative error
             err3 = err3 / err0, err4 = err4 / err0) %>%  # to strawman model
      mutate(dif12 = err1 - err2, dif13 = err1 - err3,    # Compute differences
             dif14 = err1 - err4) %>%                     # relative to cases model
      ungroup() %>%
      select(-err0)
    
    # Calculate and print median errors, for all 4 models, and just 7 days ahead
    res_err4 = res_all4 %>%
      select(-starts_with("dif")) %>%
      pivot_longer(names_to = "model", values_to = "err",
                   cols = -c(geo_value, time_value, lead)) %>%
      mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
             model = factor(model, labels = model_names))
    
    knitr::kable(res_err4 %>%
                   group_by(model, lead) %>%
                   summarize(err = median(err), n = length(unique(time_value))) %>%
                   arrange(lead) %>% ungroup() %>%
                   rename("Model" = model, "Median scaled error" = err,
                          "Target" = lead, "Test days" = n) %>%
                   filter(Target == "7 days ahead"),
                 caption = paste("Test period:", min(res_err4$time_value), "to",
                                 max(res_err4$time_value)),
                 format = "html", table.attr = "style='width:70%;'")
    Table 1: Test period: 2020-05-06 to 2020-05-14
    Model Target Median scaled error Test days
    Cases 7 days ahead 1.0393350 9
    Cases + Facebook 7 days ahead 1.0111709 9
    Cases + Google 7 days ahead 0.9838905 9
    Cases + Facebook + Google 7 days ahead 1.0168668 9

    Are these differences in median scaled errors significant? It’s hard to say, but some basic hypothesis testing suggests that they probably are: below we conduct a sign test2 for whether the difference in the “Cases” model’s scaled error and each other model’s scaled error is centered at zero. The sign test is run on the 9 test days x 440 counties = 3960 pairs of scaled errors. The p-values from the “Cases” versus “Cases + Facebook” and the “Cases” versus “Cases + Google” tests are tiny; the p-value from the “Cases” versus “Cases + Facebook + Google” test is much bigger but still below 0.01.

    # Compute p-values using the sign test against a one-sided alternative, for
    # all models, and just 7 days ahead
    res_dif4 = res_all4 %>%
      select(-starts_with("err")) %>%
      pivot_longer(names_to = "model", values_to = "dif",
                   cols = -c(geo_value, time_value, lead)) %>%
      mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
             model = factor(model,
                            labels = c("Cases vs Cases + Facebook",
                                       "Cases vs Cases + Google",
                                       "Cases vs Cases + Facebook + Google")))
    
    knitr::kable(res_dif4 %>%
                   group_by(model, lead) %>%
                   summarize(p = binom.test(x = sum(dif > 0, na.rm = TRUE),
                                            n = n(), alt = "greater")$p.val) %>%
                   ungroup() %>% filter(lead == "7 days ahead") %>%
                   rename("Comparison" = model, "Target" = lead, "P-value" = p),
                 format = "html", table.attr = "style='width:50%;'")
    Comparison Target P-value
    Cases vs Cases + Facebook 7 days ahead 0.0000048
    Cases vs Cases + Google 7 days ahead 0.0000006
    Cases vs Cases + Facebook + Google 7 days ahead 0.0088436

    We should read these with a grain of salt: the sign test here assumes independence of observations, which clearly can’t be true, given the spatiotemporal structure of our forecasting problem. To mitigate the dependence across time (which intuitively seems to matter more than that across space), we recomputed these tests in a stratified way, where for each day we run a sign test on the scaled errors between two models over all 440 counties. The results are plotted as histograms below; the “Cases + Facebook” and “Cases + Google” models appear to deliver some decently small p-values, but the story is not as clear with the “Cases + Facebook + Google” model.

    # Red, blue (similar to ggplot defaults), then yellow
    ggplot_colors = c("#FC4E07", "#00AFBB", "#E7B800")
    
    ggplot(res_dif4 %>%
             group_by(model, lead, time_value) %>%
             summarize(p = binom.test(x = sum(dif > 0, na.rm = TRUE),
                                      n = n(), alt = "greater")$p.val) %>%
             ungroup() %>% filter(lead == "7 days ahead"), aes(p)) +
      geom_histogram(aes(color = model, fill = model), alpha = 0.4) +
      scale_color_manual(values = ggplot_colors) +
      scale_fill_manual(values = ggplot_colors) +
      facet_wrap(vars(lead, model)) +
      labs(x = "P-value", y = "Count") +
      theme_bw() + theme(legend.pos = "none")

    Results: First Two Models

    Next we focus on comparing results between the “Cases” and “Cases + Facebook” models only. Restricting to a common available forecast dates yields a much longer test period, May 6 through August 25 for 7-day-ahead forecasts, and May 13 through August 18 for 14-day-ahead forecasts. The median scaled errors over the test period are computed and reported below. Now we see a decent improvement in median scaled error for the “Cases + Facebook” model, and this is true for both 7-day-ahead and 14-day-ahead forecasts.

    # Restrict to common period for just models 1 and 2, then calculate the scaled
    # errors, that is, the error relative to the strawman's error
    res_all2 = res %>%
      select(-c(err3, err4)) %>%
      drop_na() %>%                                       # Restrict to common time
      mutate(err1 = err1 / err0, err2 = err2 / err0) %>%  # Compute relative error
                                                          # to strawman model
      mutate(dif12 = err1 - err2) %>%                     # Compute differences
                                                          # relative to cases model
      ungroup() %>%
      select(-err0)
    
    # Calculate and print median errors, for just models 1 and 2, and both 7 and 14
    # days ahead
    res_err2 = res_all2 %>%
      select(-starts_with("dif")) %>%
      pivot_longer(names_to = "model", values_to = "err",
                   cols = -c(geo_value, time_value, lead)) %>%
      mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
             model = factor(model, labels = model_names[1:2]))
    
    knitr::kable(res_err2 %>%
                   select(-starts_with("dif")) %>%
                   group_by(model, lead) %>%
                   summarize(err = median(err), n = length(unique(time_value))) %>%
                   arrange(lead) %>% ungroup() %>%
                   rename("Model" = model, "Median scaled error" = err,
                          "Target" = lead, "Test days" = n),
                 caption = paste("Test period:", min(res_err2$time_value), "to",
                                 max(res_err2$time_value)),
                 format = "html", table.attr = "style='width:70%;'")
    Table 2: Test period: 2020-05-06 to 2020-08-25
    Model Target Median scaled error Test days
    Cases 7 days ahead 0.9579517 112
    Cases + Facebook 7 days ahead 0.9397304 112
    Cases 14 days ahead 1.0011765 98
    Cases + Facebook 14 days ahead 0.9753763 98

    Thanks to the extended length of the test period, we can also “unravel” these median scaled errors over time and plot their trajectories, as we do below, with the left plot concerning 7-day-ahead forecasts, and the right 14-day-ahead forecasts. These plots reveal something at once interesting and bothersome: the median scaled errors are quite volatile over time, and for some periods in July, forecasting became much “harder”, with the scaled errors reaching above 1.25 for 7-day-ahead forecasts, and above 1.5 for 14-day-ahead forecasts. Furthermore, towards the positive, we can see a clear visual difference between the median scaled errors from the “Cases + Facebook” model in red and the “Cases” model in black. The former appears to be below the latter pretty consistently over time, with the possible exception of periods where forecasting becomes “hard” and the scaled errors shoot above 1.

    # Plot median errors as a function of time, for models 1 and 2, and both 7 and
    # 14 days ahead
    ggplot(res_err2 %>%
             group_by(model, lead, time_value) %>%
             summarize(err = median(err)) %>% ungroup(),
           aes(x = time_value, y = err)) +
      geom_line(aes(color = model)) +
      scale_color_manual(values = c("black", ggplot_colors)) +
      geom_hline(yintercept = 1, linetype = 2, color = "gray") +
      facet_wrap(vars(lead)) +
      labs(x = "Date", y = "Median scaled error") +
      theme_bw() + theme(legend.pos = "bottom", legend.title = element_blank())

    Again, basic hypothesis testing suggests that the results we’re seeing here are likely significant, though it’s hard to say definitively given the complicated dependence structure present in the data. Below we perform a sign test for whether the difference in scaled errors from the “Cases” and “Cases + Facebook” models is centered at zero. Given the large sample size: 112 test days for 7-day-ahead forecasts and 98 test days for 14-day-ahead forecasts (times 440 counties for each day), the p-values are basically zero.

    # Compute p-values using the sign test against a one-sided alternative, just
    # for models 1 and 2, and both 7 and 14 days ahead
    res_dif2 = res_all2 %>%
      select(-starts_with("err")) %>%
      pivot_longer(names_to = "model", values_to = "dif",
                   cols = -c(geo_value, time_value, lead)) %>%
      mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
             model = factor(model, labels = "Cases > Cases + Facebook"))
    
    knitr::kable(res_dif2 %>%
                   group_by(model, lead) %>%
                   summarize(p = binom.test(x = sum(dif > 0, na.rm = TRUE),
                                            n = n(), alt = "greater")$p.val) %>%
                   ungroup() %>%
                   rename("Comparison" = model, "Target" = lead, "P-value" = p),
                 format = "html", table.attr = "style='width:50%;'")
    Comparison Target P-value
    Cases > Cases + Facebook 7 days ahead 0
    Cases > Cases + Facebook 14 days ahead 0

    Once we stratify and recompute p-values by forecast date, as shown in the histograms below, the bulk of p-values are still quite small.

    ggplot(res_dif2 %>%
             group_by(model, lead, time_value) %>%
             summarize(p = binom.test(x = sum(dif > 0, na.rm = TRUE),
                                      n = n(), alt = "greater")$p.val) %>%
             ungroup(), aes(p)) +
      geom_histogram(aes(color = model, fill = model), alpha = 0.4) +
      scale_color_manual(values = ggplot_colors) +
      scale_fill_manual(values = ggplot_colors) +
      facet_wrap(vars(lead, model)) +
      labs(x = "P-value", y = "Count") +
      theme_bw() + theme(legend.pos = "none")

    Varying the Number of Days Ahead

    Hypothesis tests (like the sign tests conducted above) tell us whether the differences in errors (between the forecasters) are statistically significant, but not about their practical significance. For example, for 7-day-ahead forecasts, what does an improvement of 0.018 units on the scaled error scale really mean, when comparing the “Cases + Facebook” model to the “Cases” model (over the test period May 6 through August 25)? Is this a meaningful gain?

    To answer questions like this, we can look at the way that the median scaled errors behave as a function of the number of days ahead at which we’re making the forecasts. Previously, we considered forecasting case rates just 7 and 14 days ahead; now we systematically examine 5, 6, 7, etc., through 20 days ahead. As before, we ran the code for this separately and saved the results in an RData file, which you can find in the same GitHub repo as the Rmd source for this blog post. (It’s exactly the same code as that above, but with leads = 5:20.)

    Below, we compute and plot the median scaled errors for the “Cases” and “Cases + Facebook” models for different number of days ahead for the forecast target. This is done over all forecast dates common to the two models (May 6 through August 27, or earlier—the end date gets decremented each time we increase the number of days ahead). A first glance shows that the “Cases + Facebook” model, in red, gives better median scaled errors at all ahead values; and the vertical gap between the two curves is consistently in the range of what we were seeing before (for 7 and 14 days ahead), around 0.02 units or more on the scaled error scale.

    load("forecast-demo/demo-extended.rda")
    
    # Compute and plot median errors as function of number of days ahead
    err_by_lead = res %>%
      select(-c(err3, err4)) %>%
      drop_na() %>%                                       # Restrict to common time
      mutate(err1 = err1 / err0, err2 = err2 / err0) %>%  # Compute relative error
                                                          # to strawman model
      ungroup() %>%
      select(-err0) %>%
      pivot_longer(names_to = "model", values_to = "err",
                   cols = -c(geo_value, time_value, lead)) %>%
      mutate(model = factor(model, labels = model_names[1:2])) %>%
      group_by(model, lead) %>%
      summarize(err = median(err)) %>%
      ungroup()
    
    ggplot(err_by_lead, aes(x = lead, y = err)) +
      geom_line(aes(color = model)) +
      geom_point(aes(color = model)) +
      scale_color_manual(values = c("black", ggplot_colors)) +
      geom_hline(yintercept = err_by_lead %>%
                   filter(lead %in% 7, model == "Cases") %>% pull(err),
                 linetype = 2, color = "gray") +
      labs(title = "Forecasting errors by number of days ahead",
           subtitle = sprintf("Over all counties with at least %i cumulative cases",
                              case_num),
           x = "Number of days ahead", y = "Median scaled error") +
      theme_bw() + theme(legend.pos = "bottom", legend.title = element_blank())

    But if we look at it from a different angle, and consider the horizontal gap between the curves, then we can infer something quite a bit more interesting: for 7-day-ahead forecasts, the median scaled error of the “Cases” model (marked by a horizontal gray line) is comparable to that of 12-day-ahead forecasts from the “Cases + Facebook” model. So you could say that using the % CLI-in-community signal from our Facebook survey buys us 5 extra days of lead time for this forecasting problem, which seems pretty nontrivial. Different forecast targets yield different lead times (for 14-day-ahead forecasts, for example, it appears to be more like 3 or 4 days of lead time), but the added value of the survey signal is clear throughout.

    Wrap-Up

    We’ve seen that either of the Facebook or Google % CLI-in-community signals can improve the accuracy of short-term forecasts of county-level COVID-19 case incidence rates. The significance of these improvements is more apparent with the Facebook signal, thanks to the much longer test period available. With either signal, of the magnitude of the improvement offered seems modest but nontrivial, especially because the forecasting problem is so hard in the first place.

    We reiterate that this was just a demo. Our analysis was fairly simple and lacks a few qualities that we’d expect in a truly comprehensive, realistic forecasting analysis. To name three:

    1. The models we considered are simple autoregressive structures from standard time series, and could be improved in various ways (including, considering other relevant dimensions like mobility measures, county health metrics, etc.).

    2. The forecasts we produced are point rather than distributional forecasts (that is, we predict a single number, rather than an entire distribution, for what happens 7 and 14 days ahead). Distributional forecasts portray uncertainty in a transparent way, which is important in practice.

    3. The way we trained our forecast models does not account for data latency and revisions, which are critical issues. For each (retrospective) forecast date \(t_0\), we constructed forecasts by training on data that we fetched from the API today, “as of” the day we wrote this blog post, and not “as of” the forecast date \(t_0\). This matters because nearly all signals are subject to latency (they are only available at some number of days lag) and go through multiple revisions (past data values get updated as time goes on).

    On the flip side, our example here was not that “far away” from being realistic. The models we examined are actually not too different from Delphi’s forecasters “in production”.3 Also, the way we fit LAD regression models in the code extends immediately to multiple quantile regression (just requires changing the parameter tau in the call to quantile_lasso()), which would give us distributional forecasts. And lastly, it’s fairly easy to change the data acquisition step in the code so that data gets pulled “as of” the forecast date (requires specifying the parameter as_of in the call to covidcast_signal()).

    Hopefully these preliminary findings have gotten you excited about the possible uses of our symptom survey data. To get started playing with our data yourself, take a look at our interactive COVIDcast map, or our COVIDcast API, through our R client or Python client. And if you’re feeling adventurous, consider competing in the COVID-19 Symptom Data Challenge, and trying your hand at developing novel analytic approaches to extract insights from our Facebook symptom survey data. We’ve made extensive data aggregate data from our survey (beyond CLI indicators) available for the Challenge, and submissions are due September 29, with finalists eligible for cash prizes. We look forward to seeing how you put our data to use!

    Note: This post was updated on September 25, 2020 to include the section Varying the Number of Days Ahead.


    1. The R package quantgen allows you to choose between Gurobi or GLPK as the underlying LP solver. The default is GLPK, since it’s open source; but if you can use Gurobi (which is free for academic use), then this forecast demo will run much faster.↩︎

    2. As far as nonparametric tests of medians go, Wilcoxon’s signed-rank test (for paired data, as we have here) is more popular, because it tends to be more powerful than the sign test. Applied here, it does indeed give smaller p-values pretty much across the board. However, it assumes symmetry of the distribution in question (in our case, the difference in scaled errors), whereas the sign test does not, and thus we show results from the latter.↩︎

    3. Delphi’s “production” forecasters are still based on relatively simple times series models, though to be clear they’re distributional, and we add a couple of extra layers of complexity on top of standard structures. For short-term forecasts, we’ve found that simple statistical models can be competitive with compartmental models (like SIR and its many variants), even fairly complicated ones. Being statisticians and computer scientists, we find these statistical models are easier to build, debug, and most importantly, calibrate. More on this in a future blog post.↩︎

    Acknowledgements: Delphi’s forecasting effort involves many people from our modeling team, from forecaster design, to implementation, to evaluation. The broader insights on forecasting shared in this post certainly cannot be attributable to Ryan’s work alone, and are a reflection of the work carried out by all these team members.

    Related Posts:

    Ryan Tibshirani is a lead researcher in the Delphi group, and an Associate Professor in the Department of Statistics & Data Science and the Machine Learning Department at CMU. He is also an Amazon Scholar.
    © 2021 Delphi group authors. Text and figures released under CC BY 4.0 ; code under the MIT license.

    Latest Stories