Clinical trials take a long time to be published, if they are at all. And when they are published, most of them are either missing critical information or have changed the way they describe the outcomes to suit the results they found (or wanted). Despite these problems, the vast majority of the new methods and technologies that we build in biomedical informatics to improve evidence synthesis remain focused on published articles and bibliographic databases.
Simply: no matter how accurately we can convince a machine to screen the abstracts of published articles for us, we are still bound by what doesn’t appear in those articles.
We might assume that published articles are the best source of clinical evidence we have. But there is an obvious alternative. Clinical trial registries are mature systems that have been around for more than a decade, their use is mandated by law and policy for many countries and organisations, and they tell us more completely and more quickly what research should be available. With new requirements for results reporting coming into effect, more and more trials have structured summary results available (last time I checked ClinicalTrials.gov, it was 7.8 million participants from more than 20,000 trials, and that makes up more than 20% of all completed trials).
The reality is that not all trials are registered, not all registered trials are linked to their results, and not all results are posted in clinical trial registries. And having some results available in a database doesn’t immediately solve the problems of publication bias, outcome reporting bias, spin, and failure to use results to help design replication studies.
In a systematic review of the processes used to link clinical trial registries to published articles, we found that the proportions of trials that had registrations was about the same as the proportion of registrations that had publications (when they were checked properly, not the incorrect number of 8.7 million patients you might have heard about). Depending on whether you are an optimist or a pessimist, you can say that what is available in clinical trial registries is just as good/bad as what is available in bibliographic databases.
Beyond that, the semi-structured results that are available in ClinicalTrials.gov are growing rapidly (by volume and proportion). The results data (a) help us to avoid some of the biases that appear in published research; (b) can appear earlier; (c) can be used to reconcile published results; and (d) as it turns out, make it much easier to convince a machine to screen for trial registrations that meet a certain set of inclusion criteria.
I suspect that the assumption that clinical trial registries are less useful than the published literature is a big part of the reason why nearly all of the machine learning and other natural language processing research in the area is still stuck on published articles. But that is a bad assumption.
Back in 2012, we wrote in Science Translational Medicine about how to build an open community of researchers to build and grow a repository of structured clinical trial results data. The following is based on that list but with nearly 6 years of hindsight:
- Make it accessible: And not just in the sense of being open, but by providing tools to make it easier to access results data; tools that support data extraction, searching, screening, synthesis. We already have lots of tools in this space that were developed to work with bibliographic databases, and many could easily be modified to work with results data from clinical trial registries.
- Make all tools available to everyone: The reason why open source software communities work so well is that by sharing, people build tools on top of other tools on top of methods. It was an idea of sharing that was borne of necessity back when computers were scarce, slow, and people had to make the most of their time with them. Tools for searching, screening, cleaning, extracting, and synthesising should be made available to everyone via simple user interfaces.
- Let people clean and manage data together: Have a self-correcting mechanism that allows people to update and fix problems with the metadata representing trials and links to articles and systematic reviews, even if the trial is not their own. And share those links, because there’s nothing worse than duplicating data cleaning efforts. If the Wikipedia/Reddit models don’t work, there are plenty of others.
- Help people allocate limited resources: If we really want to reduce the amount of time it takes to identify unsafe or ineffective interventions, we need to support the methods and tools that help the whole community identify the questions that are most in need of answers, and decide together how best to answer them, rather than competing to chase bad incentives like money and career progression. Methods for spotting questions with answers that may be out of date should become software tools that anyone can use.
- Make it ethical and transparent: There are situations where data should not be shared, especially when we start thinking about including individual participant data in addition to summary data. There are also risks that people may use the available data to tell stories that are simply not true. Risks related to ethics, privacy, and biases need to be addressed and tools need to be validated carefully to help people avoid mistakes whenever possible.
We are already starting to do some of this work in my team. But there are so many opportunities for people in biomedical informatics to think beyond the bibliographic databases, and to develop new methods that will transform the way we do evidence synthesis. My suggestion: start with the dot points above. Take a risk. Build the things that will diminish the bad incentives and support the good incentives.