There’s nothing easy about building an analytics infrastructure in the healthcare industry. With data piling up in petabytes every couple of months and few organizations currently capable of wrestling their troves of clinical and financial into an actionable format, the analytics landscape looks hopelessly complicated and prohibitively expensive.
But what if data scientists could help healthcare organizations understand the value and deep interdependence of their data stores in an intuitive manner based on standards and natural language? Jay Shah, Executive Vice President at Octo Consulting, believes that the newly-emerging concept of the semantic web will provide a powerful boost to the problem of organizing and understanding healthcare data by creating new connections and leveraging the latest in cutting-edge data theory.
What is the semantic web and how is it different than our traditional understanding of data?
If you look at the traditional internet, it’s really the weaving together of hyperlinks to get a web of documents, right? When you are browsing the internet, you’re effectively browsing a bunch of documents that are linked together. The challenge with interacting with documents is that you don’t always have the context for what you’re looking at. Sometimes you have images. Sometimes you have data. Sometimes you have documents and you’re not always going to, again, know what you’re searching for. The challenges of the traditional web are around understanding these relationships and getting better to a web of data, not just a web of documents.
The semantic web is about introducing some new concepts that can move the internet beyond just a network of documents and down to the data itself. That’s especially important within the context of health because it’s a much more powerful tool for you to understand the correlation between everything you’re looking at.
There are just so many drivers for why the data is coming. It’s just exploding in healthcare. Social media is just one example of that, but certainly, as EHRs are becoming more pervasive, people just have access to so many other resources to understand their own personal health. There are patterns and trends with disease, and tons of data published by the CDC and NIH, for example.
If you want to try to correlate all of that information, you have to know what you’re looking for. You can’t let the computer work for you. If you need to search for this disease in this city under these dates, then you’re putting yourself in the position where you’re the one who has to pull it all together.
The idea with semantic web is that the relationships of the data are better understood so that you’re effectively betting the machines will do a bit more of the thinking for you.
What are some of the challenges that prevent us from organizing data in this way?
One of the biggest challenges is that if you’re reading a document, you will understand the English language and the linguistic relationships things have with one another. If I say, “I have a son,” then there’s a linguistic implication that the person speaking is a mother or a father. Machines don’t understand those implied relationships. You’re forcing the relationships to be coded very specifically based on the data you have.
So if I have a bunch of cancer research, and I want to correlate that cancer research with heart research to see if a specific cancer drug is known to raise blood pressure, I have to know that question. I need to know the question I want to ask, and then I basically build my data and relate my data in such a way that those relationships between that particular cancer drug and that particular research protocol and the adverse events that it had are clear to the computer. Well, what happens if you don’t know the exact question? You just want to see the data in order to help you think of questions that you might possibly want to ask.
One of the biggest challenges is still defining those ontological relationships is a very difficult thing to do, because you have to get people to agree on what the relationships of data are. And so I know at – like at the NIH, even within the National Cancer Institute and other research institutes, they can’t necessarily agree on the right relationships between certain types of data. So I think it’s still hard to get people to agree on that.
Data standardization is also a huge obstacle, even in clinician-reported outcomes. One nurse might be doing it one way and another nurse is doing it another way. Again, we know that from your experience. It becomes difficult to reconcile the data if I’m measuring temperature in Celsius but another person is doing it in Fahrenheit, for example.
The other question is how scalable it is. I might be able to do it on a smaller scale, the tagging of the data and getting agreement on the standards and the relationships of the meta-data. But can I do that for the whole health industry? I think that’s probably one barrier that might not allow this to move super quickly. I do think though that the emergence of EHRs and meaningful use are really driving petabytes and petabytes of data to become available. These types of policy changes and technology shifts will speed up the conversation, but then I think there’s still going to be a hesitation to invest in this way of doing it. More people might say, “Well, let’s just build data warehouses and let’s do it the way that we’ve always done it,” because that just a little bit more well-known.
Is data warehousing an incorrect way of building an analytics infrastructure, or is it worthwhile to do what we can right now?
It’s not incorrect. In a data warehousing world you can probably can anticipate the first 50, 100, 200 questions someone can ask about what impact A has on B, because you understand the data to some extent. If that’s all you’re trying to do, a data warehouse is a perfectly viable alternative, because the cost in trying to link all that data semantically might actually be cost prohibitive at first.
What we’re trying to show is when it’s the tenth time or the hundredth time you either need to add more data or change the questions you want to ask to fit what you have, now the data warehouse alternative becomes very costly, because the maintenance of how you manage and mine it. When you’re trying to ask questions that no one could have anticipated, that’s where the linked data becomes more efficacious, because it’s basically accepting the fact that you’re never going to know the question you want to answer.
The business case is there because the data is just going to continue to overwhelm us. But the semantic web is still definitely a conceptual topic. There are pockets people trying to integrate the semantic web into healthcare. I think that’s happening as a grass-roots, open source concept that a lot of developers are gravitating towards, because it’s really unleashing the power of the internet.