Forecast Evaluation Dashboard

This dashboard was developed by:

  • Jed Grabman (Delphi Group, Google Fellow)
  • Kate Harwood (Delphi Group, Google Fellow)
  • Chris Scott (Delphi Group, Google Fellow)

with the Forecast Evaluation Research Collaborative:

  • Nicholas Reich (Reich Lab)
  • Jacob Bien (Delphi Group)
  • Logan Brooks (Delphi Group)
  • Estee Cramer (Reich Lab)
  • Daniel McDonald (Delphi Group)

Forecast data in all states and U.S. territories are supplied by:

COVID-19 Forecast Hub

Forecast Evaluation Research Collaborative

The Forecast Evaluation Research Collaborative was founded by:

Both groups are funded by the CDC as Centers of Excellence for Influenza and COVID-19 Forecasting. We have partnered together on this project to focus on providing a robust set of tools and methods for evaluating the performance of epidemic forecasts.

The collaborative’s mission is to help epidemiological researchers gain insights into the performance of their forecasts and lead to more accurate forecasting of epidemics.

Both groups lead initiatives related to COVID-19 data and forecast curation. The Reich Lab created and maintains the COVID-19 Forecast Hub, a collaborative effort with over 80 groups submitting forecasts to be part of the official CDC COVID-19 ensemble forecast. The Delphi Group created and maintains COVIDcast, a platform for epidemiological surveillance data, and runs the Delphi Pandemic Survey via Facebook.

The Forecaster Evaluation Dashboard is a collaborative project, which has been made possible by the 13 pro bono Google.org Fellows who have spent 6 months working full-time with the Delphi Group. Google.org is committed to the recovery of lives and communities that have been impacted by COVID-19 and investing in developing the science to mitigate the damage of future pandemics.

About the Data

Sources

Observed values are from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

Forecaster predictions are drawn from the COVID-19 Forecast Hub GitHub repository

Data for the dashboard is pulled from these sources on Mondays and Tuesdays.

Terms

  • Forecaster: A named model that produces forecasts, e.g., “COVIDhub-ensemble”

  • Forecast: A set of quantile predictions for a specific target variable, epidemiological week, and location

  • Target Variable: What the forecast is predicting, e.g., “weekly incident cases”

  • Epidemiological week (MMWR week): A standardized week that starts on a Sunday. See the CDC definition for additional details.

  • Horizon: The duration of time between when a prediction was made and the end of the corresponding epidemiological week. Following the Reich Lab definition, a forecast has a horizon of 1 week if it was produced no later than the Monday of the epidemiological week it forecasts. Thus, forecasts made 5-11 days before the end of the corresponding epidemiological week have a horizon of 1 week, 12-18 days before have a horizon of 2 weeks, etc.

Dashboard Inclusion Criteria

A forecast is only included if all the following criteria are met:

  • The target variable is the weekly incidence of either cases or deaths
  • The horizon is no more than 4 weeks ahead
  • The location is a U.S. state, territory, or the nation as a whole
  • All dates are parsable. If a date is not in yyyy/mm/dd format, the forecast may be dropped.
  • The forecast was made on or before the Monday of the relevant week. If multiple versions of a forecast are submitted then only the last forecast that meets the date restriction is included.

Notes on the Data

  • If a forecast does not include an explicit point estimate, the 0.5 quantile is taken as the point estimate for calculating absolute error.
  • WIS is only shown for forecasts that have predictions for all quantiles (23 quantiles for deaths and 15 for cases)
  • Totaling over all states and territories does not include nationwide forecasts. To ensure that values are comparable, these totals also exclude any locations that are absent from any file that was submitted by one of the selected forecasters.
  • We include revisions of observed values, which means that the scores for forecasts made in the past can change as our understanding of the ground truth changes.

Accessing the Data

The forecasts and scores are available as RDS files and are uploaded weekly to a publicly accessible AWS bucket.

You can use the url https://forecast-eval.s3.us-east-2.amazonaws.com/ + filename to download any of the files from the bucket. For instance: https://forecast-eval.s3.us-east-2.amazonaws.com/score_cards_nation_cases.rds to download scores for nation level case predictions.

The available files are:

  • predictions_cards.rds (forecasts)
  • score_cards_nation_cases.rds
  • score_cards_nation_deaths.rds
  • score_cards_state_cases.rds
  • score_cards_state_deaths.rds

Explanation of Scoring Methods

Weighted Interval Score

The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over- and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.

Spread

Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.

Absolute Error

The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.

Coverage

Coverage is an estimate of the probability that a forecaster's interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster's interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval's empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.



About the Scores

The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over- and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.

Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.

The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.

Coverage is an estimate of the probability that a forecaster's interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster's interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval's empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.

Note: All forecasts are evaluated against the latest version of observed data. Some forecasts may be scored against data that are quite different from what was observed when the forecast was made.