A Syndromic COVID-19 Indicator Based on Insurance Claims of Outpatient Visits

#medical records #COVIDcast #R
A Syndromic COVID-19 Indicator Based on Insurance Claims of Outpatient Visits

Aaron Rumack and Roni Rosenfeld


    Our COVIDcast map and API feature several novel early indicators of COVID-19 activity. In past posts, we discussed our large-scale daily surveys that, as of October 2020, have reached over 12 million people throughout the US, in partnership with Facebook and Google. In another ongoing data initiative, health system partners grant us access to various aggregate statistics from hospital records and insurance claims covering 10-15% of the United States population. From these data, we can extract informative indicators that can be early indicators of COVID activity. Early indicators are important because they help policymakers make more informed decisions and can also improve epidemiological forecasts.

    One indicator that we created from the outpatient insurance claims portion of this data is what we call the Doctor Visits or DV indicator, which estimates the percentage of outpatient visits (including telemedicine, urgent care, and emergency department visits) that are due to COVID-Like Illness or CLI. We will use % CLI-in-DV to abbreviate the percentage of outpatient visits due to CLI, the units of our DV indicator. Below, we explain how we calculate the DV indicator and discuss its relation to confirmed COVID-19 cases. We also discuss our observation that the DV indicator possesses significant spatial heterogeneity (it displays systematic county-to-county differences), and we describe a simple dynamic adjustment to address this issue.


    Electronic medical records (EMR) are instrumental in providing epidemiological information in near real-time. Sick patients interact with the healthcare system up to several weeks before they are confirmed as cases and added to the published count (see below). Thus signals based on EMR data can be early indicators, preceding official reports of confirmed cases and deaths. Early knowledge of a rise in COVID-like symptoms can enable early containment efforts to manage emerging outbreaks.

    The value of EMR data becomes apparent when you compare them to official COVID-19 case counts:

    • Limited testing capacity. Especially prevalent in the earlier days of the pandemic, limited testing capacity means that many people saw a doctor for COVID-like symptoms but did not receive a test, even though they should have. These people would be represented in our DV indicator but absent from the confirmed counts.

    • Reporting delays. From the moment a patient is first examined (or remotely evaluated), it may take several days until their specimen is collected, and another few days after testing before results are available. When tests come back positive, they must (by law) be reported to the public health authorities. The health authorities typically do further verification and investigation (for example, determining age and other demographics) before adding the cases to their public website. As a result, reports of confirmed cases trail behind the medical evaluation date by several weeks. Even with the reporting delays of insurance claims (see below), a claims-based signal allows estimates of disease activity within 3-4 days of when the outpatient visits occurred.

    • Date accuracy. Sites like JHU CSSE and USAFacts scrape and report cumulative confirmed cases based on the cases’ first date of public posting. As a result, the “number of new daily counts” (derived by successive subtraction from these reports) often refers to the number of cases published instead of those lab-confirmed on that date. In comparison, insurance claims indicate a “date of service”, which is typically closer to the testing date. Accurate dating of cases is critical for modeling, analysis, and forecasting.

    • Retrospective revisions. Some states have changed their definition of confirmed COVID-19 cases. This is sometimes done retroactively, adding hundreds or thousands of past cases and reporting them as “new” on the day that the definition was changed, resulting in an artificial spike on that date, with no way to assign these cases to their correct dates. This degrades the quality of the case time series.

    The Doctor Visits Indicator

    The Doctor Visits indicator is based solely on insurance claims. We count the outpatient claims for a given geographic area and day that fall into each of five categories:

    1. Total: All claims, whether or not related to COVID-19 (also known as all-cause claims).

    2. Flu: ICD-10 primary code starting with J09, J10, or J11, a definitive diagnosis of influenza.

    3. COVID-like: ICD-10 primary code in {U07.1, U07.2, B97.29, J12.81, Z03.818, B34.2, J12.89}. Some of these codes correspond to a definitive diagnosis of COVID-19, while others indicate strongly related conditions.

    4. Flu-like: ICD-10 primary code of J22 (acute lower respiratory infection) or B34.9 (viral infection, unspecified).

    5. Mixed: ICD-10 primary code of Z20.828 (suspected exposure to COVID-19) or J12.9 (viral pneumonia).

    We estimate the percent of COVID-like illness in doctor visits (% CLI-in-DV) as 100 * (COVID-like + Flu-like + Mixed - Flu) / Total. We subtract the “Flu” count because the higher the presence of confirmed flu, the larger the fraction of flu-like cases that are due to flu rather than to COVID. See our signal documentation site for more details.

    The indicator is available daily starting February 1, 2020 (although understandably, it is nearly zero in all locations until mid-March). To preserve privacy and data integrity, we do not report the indicator for a given location and date if there are fewer than 500 total visits in the seven days ending on that date. Following these restrictions, each day, we are able to produce estimates for about 2000 counties (roughly two-thirds of the counties in the US), accounting for over 90% of the country’s population.

    Below, we plot two maps to explore the indicator. On the left is a heatmap representing the DV indicator for each state, averaged over April 15 to May 15. On the right is a heatmap of daily confirmed COVID-19 cases per 100,000 people using data from USAFacts, averaged over the same time span. We can see that the states with many cases per capita tend to have high % CLI-in-DV, and states with fewer cases tend to have lower % CLI-in-DV, providing a good sanity check.

    options(covidcast.auth = Sys.getenv("API_KEY")) # for more on API keys, see: https://cmu-delphi.github.io/delphi-epidata/api/api_keys.html
    # Fetch DV % CLI signal and USAFacts confirmed case incidence proportion at
    # the state level
    start_day = "2020-04-15"
    end_day = "2020-05-15"
    df_dv = covidcast_signal("doctor-visits", "smoothed_adj_cli",
                             start_day, end_day, geo_type = "state")
    df_in = covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
                             start_day, end_day, geo_type = "state")
    # For each state, average the signals over all available times
    df_dv_avg = df_dv %>% group_by(geo_value) %>% summarize(value = mean(value))
    df_in_avg = df_in %>% group_by(geo_value) %>% summarize(value = mean(value))
    # Use as.covidcast_signal to turn these into covidcast data frames
    df_dv_avg <- df_dv_avg %>%
      mutate(time_value = start_day) %>%
      as.covidcast_signal(signal = "dv_avg", geo_type = "state")
    df_in_avg <- df_in_avg %>%
      mutate(time_value = start_day) %>%
      as.covidcast_signal(signal = "in_avg", geo_type = "state")
    # Plot choropleth maps, using the covidcast plotting functionality
    subtitle = paste("Averaged over", start_day, "to", end_day)
    p1 = plot(df_dv_avg,
              title = "% COVID-like illness in outpatient visits",
              range = c(0, 15), choro_params = list(subtitle = subtitle))
    p2 = plot(df_in_avg,
              title = "Daily new confirmed COVID-19 cases per 100,000 people",
              range = c(0, 25), choro_params = list(subtitle = subtitle))
    grid.arrange(p1, p2, nrow = 1)


    Medical insurance claim records introduce a few interesting challenges not present in, for example, survey data. First, insurance claims are available at a significant and variable latency, often referred to as “backfill”. Due to typically very heavy backfill that occurs right after the date of service, we do not even release the indicator for a given date until four days have passed. We continue to revise the % CLI-in-DV estimate every day thereafter based on newly received claims. For a given date of service, we typically receive only about half of the total claims within seven days following that date.

    Interestingly, certain types of claims have longer latency than others. We found that for a given date, as more claims become available, % CLI-in-DV usually (but not always) increases. COVID-related claims tend to have longer latency and thus our early estimates of the indicator usually (but not always) underestimate the final value.

    Weekday Effects

    Another challenge is the influence of the day of the week on the DV indicator. On weekends, both total counts and COVID-like counts decrease, but proportionally, total counts decrease more. This is because doctor visits during the weekend tend to focus on acute care. The total counts include many visits related to non-acute issues, but almost all COVID-like counts are due to acute issues. Without adjusting for this weekday effect, the DV indicator has a “sawtooth” pattern, spiking on weekends. We derived a method to create an adjusted indicator that accounts for this weekday effect (for a precise description, see our signal documentation). Below, we visualize the effect of making these adjustments. When we do not adjust for the weekday effect, we see a sawtooth pattern that clearly does not represent true changes in COVID-like illness within a location. However, after making the weekday adjustment, we get a smooth curve that looks reasonable. It is important to note that this adjustment is not temporal smoothing! Rather, we are making an adjustment each day based on historical patterns of weekday-to-weekend differences.

    start_day = "2020-05-01"
    end_day = "2020-08-01"
    data_unadjusted = covidcast_signal("doctor-visits", "smoothed_cli",
                                       start_day = start_day, end_day = end_day,
                                       geo_type = "state")
    data_adjusted = covidcast_signal("doctor-visits", "smoothed_adj_cli",
                                     start_day = start_day, end_day = end_day,
                                     geo_type = "state")
    data_unadjusted = data_unadjusted %>%
      select(geo_value, signal, time_value, value) %>%
      tidyr::pivot_wider(names_from = signal, values_from = value)
    data_adjusted = data_adjusted %>%
      select(geo_value, signal, time_value, value) %>%
      tidyr::pivot_wider(names_from = signal, values_from = value)
    cmb_df = inner_join(data_unadjusted, data_adjusted)
    cmb_df$geo_value = toupper(cmb_df$geo_value)
    states_to_plot = c("AZ", "CA", "GA", "FL", "NJ", "NY", "PA", "TX", "WI")
    ggplot_colors = c("#FC4E07", "#00AFBB") # Red, blue similar to ggplot defaults
    ggplot(cmb_df %>% filter(geo_value %in% states_to_plot)) +
      geom_line(aes(x = time_value, y = smoothed_cli, color = "Original")) +
      geom_line(aes(x = time_value, y = smoothed_adj_cli, color = "Adjusted")) +
      scale_color_manual(values = ggplot_colors[2:1]) +
      facet_wrap(vars(geo_value)) +
      labs(x = "Date", y = "% CLI-in-DV",
           title = "DV indicator, with and without weekday adjustment") +
      theme_bw() +
      theme(legend.position = "bottom", legend.title = element_blank())

    Basic Correlation Analyses

    As a simple quantitative analysis, we can measure the correlation between the DV indicator and daily COVID-19 case rates. The hope is that when the DV indicator is high, the daily case rate (new daily case count per 100,000 people) is also high. We can measure this correlation across two axes: space and time. Everywhere below, by “correlation” we actually mean Spearman correlation (correlation between ranks); in addition, in all experiments (unless otherwise noted) we restrict our attention to counties with at least 500 cumulative COVID-19 cases. First, we consider correlations sliced by space (correlations-by-space, for short): this means, for each county, we correlate the vector of % CLI-in-DV values over time with that of COVID-19 case rates. The distribution of these correlations, plotted below, is concentrated around 0.75. This indicates that, separately for many counties, days where % CLI-in-DV is high tend to coincide with days where case rates are also high.

    start_day = "2020-04-15"
    end_day = "2020-10-01"
    df_adjusted = covidcast_signal("doctor-visits", "smoothed_adj_cli",
                                   start_day, end_day)
    df_cases = covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
                                start_day, end_day)
    case_num = 500
    cumulative_case_df = covidcast_signal("usa-facts", "confirmed_cumulative_num",
    geo_values = cumulative_case_df %>%
      filter(value >= case_num) %>% pull(geo_value)
    dv_cases_df = bind_rows(df_adjusted, df_cases) %>%
      filter(geo_value %in% geo_values) %>%
      select(geo_value, signal, time_value, value) %>%
      tidyr::pivot_wider(names_from = signal, values_from = value) %>%
      rename(cases = confirmed_7dav_incidence_prop, dv = smoothed_adj_cli) %>%
      filter(purrr::map_lgl(geo_value, function(fips) {
        substr(fips, 3, 5) != "000"})) %>%
      group_by(time_value) %>%
    df_cor_by_time =
      covidcast_cor(df_adjusted %>%
                      filter(geo_value %in% geo_values),
                    df_cases %>%
                      filter(geo_value %in% geo_values),
                    by = "time_value", method = "spearman")
    df_cor_by_space =
      covidcast_cor(df_adjusted %>%
                      filter(geo_value %in% geo_values),
                    df_cases %>%
                      filter(geo_value %in% geo_values),
                    by = "geo_value", method = "spearman")
    ggplot(df_cor_by_space) +
      geom_density(aes(x = value), fill = "gray") +
      labs(title = "Correlation-by-space between DV indicator and case rates",
           subtitle = "Over all counties with at least 500 cumulative cases",
           x = "Correlation", y = "Density") +

    Locations with small counts usually have a lower signal-to-noise ratio, which means that their correlations will usually be lower. In the next plot, we show that the average correlation (still sliced by space) is higher when we restrict our attention to counties with higher cumulative case counts. The average correlation increases substantially when considering only counties that have more than 2,000 cumulative COVID-19 cases, and increasing the threshold from there results in only a modest increase in correlation.

    df_cor_by_space = df_cor_by_space %>%
      inner_join(cumulative_case_df %>%
                   select(geo_value, value) %>%
                   rename(cases = value))
    tibble(thresholds = seq(500, 20000, by = 500)) %>%
      mutate(avg_corr = purrr::map_dbl(
        thresholds, function(t) {
          mean(df_cor_by_space %>% filter(cases >= t) %>% pull(value), na.rm=T)
        })) %>%
      ggplot() +
      geom_line(aes(x = thresholds, y = avg_corr)) +
      labs(x = "Cumulative cases threshold", y = "Correlation",
           title = "Mean correlation-by-space between DV indicator and case rates",
           subtitle = "Over all counties with cumulative cases above threshold") +

    Now, we consider correlations sliced by time (correlations-by-time, for short): this means, for each day, we correlate the vector of % CLI-in-DV values over counties with that of COVID-19 case rates. The correlation remains somewhat steady into the middle of July, but starting at the end of July, the correlation starts to decrease. By late August, the correlation is essentially zero. In other words, on a given day in April or May, a county with a higher % CLI-in-DV was more likely to have a higher case rate than a county with a lower % CLI-in-DV. But by August and September, this was no longer true.

    ggplot(df_cor_by_time) +
      geom_line(aes(x = time_value, y = value)) +
      labs(title = "Correlation-by-time between DV indicator and case rates",
           subtitle = "Over all counties with at least 500 cumulative cases",
           x = "Date", y = "Correlation") +

    Spatial Heterogeneity

    This correlation deterioration was surprising to us, so we set out to find out the reason. After testing many hypotheses, we found one cause: some areas of the country have a higher % CLI-in-DV “baseline” than others. We illustrate this with the plots below, where we plot the average case rate and average DV indicator for each of the 10 HHS regions (each region is comprised of 2-7 states; you can find the details on the Department of HHS website).

    fips_to_hhs = function(fips) {
      st = as.integer(substr(fips,1,2))
      if (st %in% c(9,23,25,33,44,50)) {
      } else if (st %in% c(34,36,72,78)) {
      } else if (st %in% c(10,11,24,42,51,54)) {
      } else if (st %in% c(1,12,13,21,28,37,45,47)) {
      } else if (st %in% c(17,18,26,27,39,55)) {
      } else if (st %in% c(5,22,35,40,48)) {
      } else if (st %in% c(19,20,29,31)) {
      } else if (st %in% c(8,30,38,46,49,56)) {
      } else if (st %in% c(4,6,15,32,60,66,69)) {
      } else if (st %in% c(2,16,41,53)) {
      } else {
    dv_cases_df = dv_cases_df %>%
      mutate(hhs = purrr::map_dbl(geo_value, fips_to_hhs))
    dv_cases_df = dv_cases_df %>%
      left_join(county_census %>%
                  select(FIPS, POPESTIMATE2019) %>%
                  rename(geo_value = FIPS, pop = POPESTIMATE2019))
    p1 = dv_cases_df %>%
      group_by(time_value, hhs) %>%
      summarize(dv = sum(dv * pop) / sum(pop), cases = sum(cases * pop) / sum(pop)) %>%
      ungroup() %>%
      ggplot() +
      geom_line(aes(x = time_value, y = cases, color = as.factor(hhs))) +
      labs(title = "Mean case rate per HHS region",
           x = "Date", y = "New cases per 100,000 people", color = "HHS") +
    p2 = dv_cases_df %>%
      group_by(time_value, hhs) %>%
      summarize(dv = sum(dv * pop / sum(pop), na.rm = TRUE),
                cases = sum(cases * pop / sum(pop), na.rm = TRUE)) %>%
      ungroup() %>%
      ggplot() +
      geom_line(aes(x = time_value, y = dv, color = as.factor(hhs))) +
      labs(title = "Mean DV indicator per HHS region",
           x = "Date", y = "% CLI-in-DV", color = "HHS") +
    grid.arrange(p1, p2, nrow = 1)

    Let’s look more closely at the behavior of our indicator in HHS 2 (New Jersey and New York). We will compare the two curves in two ways: when the case rate is high (or low) in HHS 2 compared to other days, is the DV indicator also high (or low)? And when the case rate is high (or low) in HHS 2 compared to other HHS regions, is the DV indicator also high (or low) in HHS 2? The first question asks how high the correlation-by-space is, and the second asks how high the correlation-by-time is.

    The answer to the first question is largely yes. The case rate curve is steadily decreasing throughout April and May, flattening out by mid-June. The DV indicator peaks a little later than case rates do, but decreases throughout May and is mostly flat by mid-June. This tells us that the correlation-by-space is good in HHS 2. However, the answer to the second question reveals an issue with the DV indicator. The case rate in HHS 2 is the highest of all HHS regions in May and into June, but by July, HHS 2 is among the lowest in case rate. This is not true with the DV indicator. HHS 2 has the highest % CLI-in-DV almost throughout the entire time period, even when its case rate is one of the lowest. So we see why the correlation-by-time began to decline starting in August: counties in HHS 2 tended to have low case rates but high % CLI-in-DV values, driving down the correlations.

    We have now identified the problem: a % CLI-in-DV of, say, 5% might mean a very low case rate for a county in New York but a very high case rate for a county in Oregon. Since we have six months’ worth of history for both the DV indicator and case rates, we can correct for this problem by regressing case rates on the DV indicator in a location-specific manner. We call this sensorization, that is, the process of turning the DV indicator into a sensor. In the past, the Delphi group has done extensive work in creating sensors from multiple data sources for tracking flu, dengue, and norovirus; and developed sensor fusion methodology for combining multiple sensors into a single unified estimate. For more, see Chapter 4 of Farrow 2016 and Jahja et al. 2019.

    In the present example, we sensorize by fitting a simple linear regression model of case rates on % CLI-in-DV, separately for each day and location, training on data from the past six weeks. Because we fit the regression anew each day based on the most recently available data, the method adapts to (potential) gradual changes in the underlying relationship between % CLI-in-DV and case rate. This method is mostly real-time, but for backfill in the DV indicator, as discussed above.

    dv_cases_df = dv_cases_df %>%
      group_by(geo_value) %>%
      mutate(coef_6wk =
                          function(d) {
                            mask = time_value %in% (d-seq(7,41))
                            tryCatch(lm(cases[mask] ~ dv[mask])[["coefficients"]],
                                     error = function(x){NA})})) %>%
    dv_cases_df = dv_cases_df %>%
      mutate(slope_6wk = purrr::map_dbl(coef_6wk, function(c) {
        tryCatch(c[[2]], error = function(x) {NA})})) %>%
      mutate(int_6wk = purrr::map_dbl(coef_6wk, function(c) {
        tryCatch(c[[1]], error = function(x) {NA})})) %>%
    lm_dv_cases_df = dv_cases_df %>%
      group_by(geo_value) %>%
      summarize(b = tryCatch(lm(cases ~ dv)[["coefficients"]][[1]],
                           error = function(x){NA}),
                m = tryCatch(lm(cases ~ dv)[["coefficients"]][[2]],
                           error = function(x){NA})) %>%
    df_cor_by_time_adj = covidcast_cor(
      dv_cases_df %>%
        mutate(value = dv*slope_6wk + int_6wk, issue = time_value) %>%
        as.covidcast_signal(signal = "dv_cases"),
      dv_cases_df %>%
        mutate(value = cases, issue = time_value) %>%
        as.covidcast_signal(signal = "dv_cases"),
      by = "time_value",
      method = "spearman")
    inner_join(df_cor_by_time %>% rename(orig_cor = value),
               df_cor_by_time_adj %>% rename(adj_cor = value),
               by = "time_value") %>%
      ggplot() +
      geom_line(aes(time_value, orig_cor, color = "Original")) +
      geom_line(aes(time_value, adj_cor, color = "Sensorized")) +
      scale_color_manual(values = ggplot_colors[1:2]) +
      labs(x = "Date", y = "Correlation",
           title = "Correlation-by-time of DV indicator and case rates",
           subtitle = "Over all counties with at least 500 cumulative cases") +
      theme_bw() +
      theme(legend.position = "bottom", legend.title = element_blank())

    From the plot above, we see that sensorizing the DV indicator greatly improves the correlations-by-time.

    It turns out that the regression coefficients in the method described above, which are fit using a sliding window, change significantly over time. As a future direction, we plan to look into this nonstationary relationship between the DV indicator and case rates to better understand why it changes and why counties in New York and New Jersey have such a different relationship in the first place.

    Exciting New Data

    We will end with exciting news: we have recently been granted access to medical insurance claims from Change Healthcare, a large healthcare technology company, and are in the process of creating multiple indicators from it. Their data covers approximately 45% of the population of the United States, so we expect it to enable us to substantially boost the accuracy, coverage, and resolution of our EMR-based indicators. We are also interested in comparing our new Change Healthcare indicator (a work in progress) to our current indicator in sample size, correlations, and behavior. Some insurance providers cover different populations with different rates, which could make their indicator more or less correlated with the reported case rates and other relevant indicators. We expect that by combining data sets and dramatically increasing coverage we can produce a significantly more useful indicator.

    Acknowledgements: Maria Jahja contributed immensely to every stage of this project, from determining which ICD codes to use to the final implementation of the indicator. Aaron Rumack devised the weekday adjustment and analyzed the performance of the DV indicator. Roni Rosenfeld worked closely with our health systems partners to get access to the data and provided domain knowledge to ensure that the data was useful. Both Roni and Ryan Tibshirani provided helpful suggestions and insights towards the methodology and analysis.

    Related Posts:

    Aaron Rumack is a Ph.D. student in the Machine Learning Department at CMU, advised by Roni Rosenfeld. He has been a member of the Delphi group since 2017.
    Roni Rosenfeld is a Principal Investigator in the Delphi group and a Professor and Head of the Machine Learning Department at CMU. He is also a Google Fellow.
    © 2024 Delphi group authors. Text and figures released under CC BY 4.0 ; code under the MIT license.

    Latest Stories