- Researchers have developed an improved influenza-like illness (ILI) reporting system, thanks to the use of EHR technology and health data sharing. By aggregating information from EHRs and crowd-sourced data sources, researchers have developed a program that can help shed light on improvements to the CDC’s influenza-like illnesses (ILI) reports.
A study utilized user-generated data from athenahealth EHR records, Google Trends, Google Flu Trends (GFT), Twitter microblogs, and Flu Near You (FNY). By using three different algorithms, the researchers were able to combine the five data sources to create an ensemble prediction system, providing considerable insights into aggregating data sources.
The EHR information the researchers used from athenahealth made considerable contributions to the predictive and real-time uses of the ensemble predictor. Because the athenahealth EHRs update flu symptom information weekly, and generally earlier than the CDC does, this data source helped the ensemble predictor increase its real-time and forecasting capabilities.
“By dynamically finding the best linear model to historically map athenahealth’s ILI onto CDC’s ILI, we were able to produce (out-of-sample) ILI estimates using athenahealth’s data as a predictor, one week ahead of CDC reports during our study period,” the researchers reported.
Not only did this study result in a usable tool for flu trend predictions, but it also provided considerable insights into the benefits of health data sharing and data integration.
“[O]ur results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons,” the researchers wrote.
The athenahealth EHR data, combined with the rest of the crowd-sourced data, helped create a system that has real-time and predictive abilities and is comparable to the CDC ILI predictor.
The CDC releases two ILI reports -- one that is essentially the raw predictions, and one that is revised to display the actual data regarding ILI reports during a certain time period. The results showed that the ensemble predictor was as accurate to the revised CDC report as the raw CDC predictions. This was an interesting finding, researchers said.
“It is interesting to highlight that the correlation and RMSE of the ensemble approach realtime predictions... are similar to the differences between revised and unrevised CDC reports,” the researchers wrote. “This means that our real-time ensemble model is as accurate a predictor of the revised CDC’s ILI estimates as the unrevised CDC data is. Thus, it is possible that we may be reaching the limit of what is possible, in terms of producing an accurate predictor of revised CDC’s ILI.”
Additionally, the project compared the different data sources against one another for effectiveness. The results show that overall, data integration and health information sharing is more effective than individual data sources.
“Our results show that our real-time ensemble predictions outperform every real-time flu predictor constructed independently with each data source,” the researchers reported. “This fact suggests that combining information from multiple independent flu predictors is advantageous over simply choosing the best performing predictor. This is the case not only for real-time predictions but also for the one, two and three week forecasts presented.”
These test results may be relevant to more than just flu season. Although the study specifically looked at data sources for flu symptoms, these methods can be applied to several different illnesses, revolutionizing the way reporting is done for infectious diseases.
“Indeed, infectious diseases such as Dengue or Malaria, for which multiple surveillance methods are in place would benefit from combining information in a similar way to the one proposed here,” the researchers explained.
Approaches such as the ensemble predictor may also be effective in areas where data aggregation is scarcer because it utilizes multiple crowd-sourced data sources.
“Moreover, disease surveillance data at finer spatial resolutions tend to be scarcer and often unreliable, and thus, approaches like ours may help produce more accurate and robust disease incidence estimates, at higher spatial resolutions, by drawing data from multiple sources,” the researchers wrote.