Forecast Evaluation Dashboard

This dashboard was developed by:

  • Chris Scott (Delphi Group, Google Fellow)
  • Kate Harwood (Delphi Group, Google Fellow)
  • Jed Grabman (Delphi Group, Google Fellow)

with the Forecast Evaluation Research Collaborative:

  • Ryan Tibshirani (Delphi Group)
  • Nicholas Reich (Reich Lab)
  • Evan Ray (Reich Lab)
  • Daniel McDonald (Delphi Group)
  • Estee Cramer (Reich Lab)
  • Logan Brooks (Delphi Group)
  • Johannes Bracher (Karlsruhe Institute)
  • Jacob Bien (Delphi Group)

Forecast data in all states and U.S. territories are supplied by:

COVID-19 Forecast Hub

Forecast Evaluation Research Collaborative

The Forecast Evaluation Research Collaborative was founded by:

Both groups are funded by the CDC as Centers of Excellence for Influenza and COVID-19 Forecasting. We have partnered together on this project to focus on providing a robust set of tools and methods for evaluating the performance of epidemic forecasts.

The collaborative’s mission is to help epidemiological researchers gain insights into the performance of their forecasts and lead to more accurate forecasting of epidemics.

Both groups lead initiatives related to COVID-19 data and forecast curation. The Reich Lab created and maintains the COVID-19 Forecast Hub, a collaborative effort with over 80 groups submitting forecasts to be part of the official CDC COVID-19 ensemble forecast. The Delphi Group created and maintains COVIDcast, a platform for epidemiological surveillance data, and runs the U.S. COVID-19 Trends and Impact Survey in partnership with Facebook.

The Forecaster Evaluation Dashboard is a collaborative project, which has been made possible by the 13 pro bono Google.org Fellows who have spent 6 months working full-time with the Delphi Group. Google.org is committed to the recovery of lives and communities that have been impacted by COVID-19 and investing in developing the science to mitigate the damage of future pandemics.

About the Data

Sources

Observed cases and deaths are from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

Observed hospitalizations are from the U.S. Department of Health & Human Services and is the sum of all adult and pediatric COVID-19 hospital admissions.

Forecaster predictions are drawn from the COVID-19 Forecast Hub GitHub repository

Data for the dashboard is pulled from these sources on Sunday, Monday, and Tuesday each week.

Terms

  • Forecaster: A named model that produces forecasts, e.g., “COVIDhub-ensemble”

  • Forecast: A set of quantile predictions for a specific target variable, epidemiological week, and location

  • Target Variable: What the forecast is predicting, e.g., “weekly incident cases”

  • Epidemiological week (MMWR week): A standardized week that starts on a Sunday. See the CDC definition for additional details.

  • Horizon: The duration of time between when a prediction was made and the end of the corresponding epidemiological week. Following the Reich Lab definition, a forecast has a horizon of 1 week if it was produced no later than the Monday of the epidemiological week it forecasts. Thus, forecasts made 5-11 days before the end of the corresponding epidemiological week have a horizon of 1 week, 12-18 days before have a horizon of 2 weeks, etc.

Dashboard Inclusion Criteria

A forecast is only included if all the following criteria are met:

  • The target variable is the weekly incidence of either cases or deaths, or the daily incidence of hospitalizations
  • The horizon is no more than 4 weeks ahead
  • The location is a U.S. state, territory, or the nation as a whole
  • All dates are parsable. If a date is not in yyyy/mm/dd format, the forecast may be dropped.
  • The forecast was made on or before the Monday of the relevant week. If multiple versions of a forecast are submitted then only the last forecast that meets the date restriction is included.

How Hospitalization Forecasts are Processed

Though hospitalizations are forecasted on a daily basis, in keeping with the cases and death scoring and plotting, we show the hospitalization scores on a weekly basis in the dashboard. We only look at forecasts for one target day a week (currently Wednesdays), and calculate the weekly horizons accordingly. Hospitalization horizons are calculated in the following manner:

  • 2 days ahead: Forecast date is on or before the Monday preceeding the target date (Wednesday)
  • 9 days ahead: Forecast date equal to or before 7 days before the Monday preceeding the target date
  • 16 days ahead: Forecast date is equal to or before 14 days before the Monday preceeding the target date
  • 23 days ahead: Forecast date equal to or before 21 days before the Monday preceeding the target date

Notes on the Data

  • If a forecast does not include an explicit point estimate, the 0.5 quantile is taken as the point estimate for calculating absolute error.
  • The weighted interval score is only shown for forecasts that have predictions for all quantiles (23 quantiles for deaths and hospitalizations and 7 for cases)
  • Totaling over all states and territories does not include nationwide forecasts. To ensure that values are comparable, these totals also exclude any locations that are absent from any file that was submitted by one of the selected forecasters.
  • For scoring, we include revisions of observed values, which means that the scores for forecasts made in the past can change as our understanding of the ground truth changes.
  • The observed data can also be viewed ‘as of’ a certain date, which shows what observed data a forecaster had available when a past forecast was made (but the forecasts are always scored on the latest revision of the observed data).

Accessing the Data

The forecasts and scores are available as RDS files and are uploaded weekly to a publicly accessible AWS bucket.

You can use the url https://forecast-eval.s3.us-east-2.amazonaws.com/ + filename to download any of the files from the bucket.

For instance: https://forecast-eval.s3.us-east-2.amazonaws.com/score_cards_nation_cases.rds to download scores for nation level case predictions.

The available files are:

  • predictions_cards.rds (forecasts)
  • score_cards_nation_cases.rds
  • score_cards_nation_deaths.rds
  • score_cards_state_cases.rds
  • score_cards_state_deaths.rds
  • score_cards_state_hospitalizations.rds
  • score_cards_nation_hospitalizations.rds

You can also connect to AWS and retrieve the data in R. Example of retrieving state cases file:

library(aws.s3)
Sys.setenv("AWS_DEFAULT_REGION" = "us-east-2")
s3bucket = tryCatch(
  {
    get_bucket(bucket = 'forecast-eval')
  },
  error = function(e) {
    e
  }
)

stateCases = tryCatch(
  {
    s3readRDS(object = "score_cards_state_cases.rds", bucket = s3bucket)
  },
  error = function(e) {
    e
  }
)
Forecasts with actuals

If you are interested in getting the forecasts paired with the corresponding actual values (if you were e.g. testing different evaluation methods), that can be found in the Amazon S3 bucket in 3 zip files. These files are static, generated using the aggregation script, and forecast and actual data available on June 12, 2023. The latest forecast date available for each target signal is

If the S3 bucket is down, these files are also available on Delphi’s file-hosting site.


Explanation of Scoring Methods

Weighted Interval Score

The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over- and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.

Spread

Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.

Absolute Error

The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.

Coverage

Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.



DATA IS LOADING...(this may take a while)
Fetching 'as of' data and loading observed values...
About the Scores

The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over- and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.

Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.

The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.

Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.

Note: All forecasts are evaluated against the latest version of observed data. Some forecasts may be scored against data that are quite different from what was observed when the forecast was made.



DATA IS LOADING...(this may take a while)
Fetching 'as of' data and loading observed values...
About the Scores

The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over- and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.

Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.

The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.

Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.

Note: All forecasts are evaluated against the latest version of observed data. Some forecasts may be scored against data that are quite different from what was observed when the forecast was made.