Forecast Evaluation Dashboard
This dashboard was developed by:
 Chris Scott (Delphi Group, Google Fellow)
 Kate Harwood (Delphi Group, Google Fellow)
 Jed Grabman (Delphi Group, Google Fellow)
with the Forecast Evaluation Research Collaborative:
 Ryan Tibshirani (Delphi Group)
 Nicholas Reich (Reich Lab)
 Evan Ray (Reich Lab)
 Daniel McDonald (Delphi Group)
 Estee Cramer (Reich Lab)
 Logan Brooks (Delphi Group)
 Johannes Bracher (Karlsruhe Institute)
 Jacob Bien (Delphi Group)
Forecast data in all states and U.S. territories are supplied by:
Forecast Evaluation Research Collaborative
The Forecast Evaluation Research Collaborative was founded by:
 Carnegie Mellon University Delphi Group
 UMassAmherst Reich Lab
Both groups are funded by the CDC as Centers of Excellence for Influenza and COVID19 Forecasting. We have partnered together on this project to focus on providing a robust set of tools and methods for evaluating the performance of epidemic forecasts.
The collaborative’s mission is to help epidemiological researchers gain insights into the performance of their forecasts and lead to more accurate forecasting of epidemics.
Both groups lead initiatives related to COVID19 data and forecast curation. The Reich Lab created and maintains the COVID19 Forecast Hub, a collaborative effort with over 80 groups submitting forecasts to be part of the official CDC COVID19 ensemble forecast. The Delphi Group created and maintains COVIDcast, a platform for epidemiological surveillance data, and runs the U.S. COVID19 Trends and Impact Survey in partnership with Facebook.
The Forecaster Evaluation Dashboard is a collaborative project, which has been made possible by the 13 pro bono Google.org Fellows who have spent 6 months working fulltime with the Delphi Group. Google.org is committed to the recovery of lives and communities that have been impacted by COVID19 and investing in developing the science to mitigate the damage of future pandemics.
About the Data
Sources
Observed cases and deaths are from the COVID19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
Observed hospitalizations are from the U.S. Department of Health & Human Services and is the sum of all adult and pediatric COVID19 hospital admissions.
Forecaster predictions are drawn from the COVID19 Forecast Hub GitHub repository
Data for the dashboard is pulled from these sources on Sunday, Monday, and Tuesday each week.
Terms

Forecaster: A named model that produces forecasts, e.g., “COVIDhubensemble”

Forecast: A set of quantile predictions for a specific target variable, epidemiological week, and location

Target Variable: What the forecast is predicting, e.g., “weekly incident cases”

Epidemiological week (MMWR week): A standardized week that starts on a Sunday. See the CDC definition for additional details.

Horizon: The duration of time between when a prediction was made and the end of the corresponding epidemiological week. Following the Reich Lab definition, a forecast has a horizon of 1 week if it was produced no later than the Monday of the epidemiological week it forecasts. Thus, forecasts made 511 days before the end of the corresponding epidemiological week have a horizon of 1 week, 1218 days before have a horizon of 2 weeks, etc.
Dashboard Inclusion Criteria
A forecast is only included if all the following criteria are met:
 The target variable is the weekly incidence of either cases or deaths, or the daily incidence of hospitalizations
 The horizon is no more than 4 weeks ahead
 The location is a U.S. state, territory, or the nation as a whole
 All dates are parsable. If a date is not in yyyy/mm/dd format, the forecast may be dropped.
 The forecast was made on or before the Monday of the relevant week. If multiple versions of a forecast are submitted then only the last forecast that meets the date restriction is included.
How Hospitalization Forecasts are Processed
Though hospitalizations are forecasted on a daily basis, in keeping with the cases and death scoring and plotting, we show the hospitalization scores on a weekly basis in the dashboard. We only look at forecasts for one target day a week (currently Wednesdays), and calculate the weekly horizons accordingly. Hospitalization horizons are calculated in the following manner:
 2 days ahead: Forecast date is on or before the Monday preceeding the target date (Wednesday)
 9 days ahead: Forecast date equal to or before 7 days before the Monday preceeding the target date
 16 days ahead: Forecast date is equal to or before 14 days before the Monday preceeding the target date
 23 days ahead: Forecast date equal to or before 21 days before the Monday preceeding the target date
Notes on the Data
 If a forecast does not include an explicit point estimate, the 0.5 quantile is taken as the point estimate for calculating absolute error.
 The weighted interval score is only shown for forecasts that have predictions for all quantiles (23 quantiles for deaths and hospitalizations and 7 for cases)
 Totaling over all states and territories does not include nationwide forecasts. To ensure that values are comparable, these totals also exclude any locations that are absent from any file that was submitted by one of the selected forecasters.
 For scoring, we include revisions of observed values, which means that the scores for forecasts made in the past can change as our understanding of the ground truth changes.
 The observed data can also be viewed ‘as of’ a certain date, which shows what observed data a forecaster had available when a past forecast was made (but the forecasts are always scored on the latest revision of the observed data).
Accessing the Data
The forecasts and scores are available as RDS files and are uploaded weekly to a publicly accessible AWS bucket.
You can use the url https://forecasteval.s3.useast2.amazonaws.com/ + filename to download any of the files from the bucket.
For instance: https://forecasteval.s3.useast2.amazonaws.com/score_cards_nation_cases.rds to download scores for nation level case predictions.
The available files are:
 predictions_cards.rds (forecasts)
 score_cards_nation_cases.rds
 score_cards_nation_deaths.rds
 score_cards_state_cases.rds
 score_cards_state_deaths.rds
 score_cards_state_hospitalizations.rds
 score_cards_nation_hospitalizations.rds
You can also connect to AWS and retrieve the data in R. Example of retrieving state cases file:
library(aws.s3)
Sys.setenv("AWS_DEFAULT_REGION" = "useast2")
s3bucket = tryCatch(
{
get_bucket(bucket = 'forecasteval')
},
error = function(e) {
e
}
)
stateCases = tryCatch(
{
s3readRDS(object = "score_cards_state_cases.rds", bucket = s3bucket)
},
error = function(e) {
e
}
)
Forecasts with actuals
If you are interested in getting the forecasts paired with the corresponding actual values (if you were e.g. testing different evaluation methods), that can be found in the Amazon S3 bucket in 3 zip files. These files are static, generated using the aggregation script, and forecast and actual data available on June 12, 2023. The latest forecast date available for each target signal is
 cases: 20230213
 hospitalizations:
 1 week: 20230605
 2 week: 20230605
 3 week: 20230605
 4 week: 20230605
 deaths: 20230306
If the S3 bucket is down, these files are also available on Delphi’s filehosting site.
Explanation of Scoring Methods
Weighted Interval Score
The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.
Spread
Spread is a component of the weighted interval score. It is a weighted average of the widths of the prediction intervals and does not depend on the ground truth. Spread is described in this article (note: in this paper spread is defined as “sharpness”). A smaller spread score indicates narrower intervals. Models that have narrower intervals are implying a higher level of certainty in their forecast that may or may not be warranted.
Absolute Error
The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.
Coverage
Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.
The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.
The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.
Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.
Note: All forecasts are evaluated against the latest version of observed data. Some forecasts may be scored against data that are quite different from what was observed when the forecast was made.
The weighted interval score (WIS) is a proper score that combines a set of prediction interval scores. As described in this article it “can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness [spread] and penalties for over and underprediction.” With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.
The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.
Coverage is an estimate of the probability that a forecaster’s interval (at a certain nominal level such as 80%) correctly includes the actual value. It is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.
Note: All forecasts are evaluated against the latest version of observed data. Some forecasts may be scored against data that are quite different from what was observed when the forecast was made.