On the value of deplatforming, and seeing online misinformation as an opportunity to counter misinformed beliefs in front of a key audience

Adam Dunn, University of Sydney

The government is rolling out a new public information campaign this week to reassure the public about the safety of COVID-19 vaccines, which one expert has said “couldn’t be more crucial” to people actually getting the jabs when they are available.

Access to vaccines is the most important barrier to achieving good coverage, so a campaign that explains the who, how, and when should go a long way toward getting the right people vaccinated at the right time.

But it also comes as government ministers — and even the prime minister — have refused to address the COVID-19 misinformation coming from those within their own ranks.

Despite advice from the Therapeutic Goods Administration explaining that hydroxychloroquine is not an effective treatment for COVID-19, MP Craig Kelly has continued to promote the opposite on Facebook. A letter he wrote on the same topic, bearing the Commonwealth coat of arms was also widely distributed.

He has also incorrectly advocated the use of the anti-parasitic drug ivermectin as a treatment for COVID-19, and encouraged people to protest against what he called “health bureaucrats in an ivory tower”.

Compared to health experts, politicians and celebrities tend to have access to larger and more diverse audiences, particularly on social media. But politicians and celebrities may not always have the appraisal skills they need to assess clinical evidence.

I spend much of my time examining how researchers introduce biases into the design and reporting of trials and systematic reviews. Kelly probably has less experience in critically appraising trial design and reporting. But if he and I were competing for attention among Australians, his opinions would certainly reach a much larger and varied segment of the population.

Does misinformation really cause harm?

According to a recent Quantum Market Research survey of 1,000 people commissioned by the Department of Health, four in five respondents said they were likely to get a COVID-19 vaccine when it’s made available.

Australia generally has high levels of vaccine confidence compared to other wealthy countries – 72% strongly agree that vaccines are safe and less than 2% strongly disagree.

But there does appear to be some hesitancy about the COVID-19 vaccine. In the Quantum survey, 27% of respondents overall, and 42% of women in their 30s, had concerns about vaccine safety. According to the report, this showed “a need to dispel some specific fears held by certain cohorts of the community in relation to potential adverse side effects.

For other types of COVID misinformation, a University of Sydney study found that younger men had stronger agreement with misconceptions and myths, such as the efficacy of hydroxychloroquine as a treatment, that 5G networks spread the virus or that the virus was engineered in a lab.

Surveys showing how attitudes and beliefs vary by demographics are useful, but it is difficult to know how exposure to misinformation affects the decisions people make about their health in the real world.

Studies measuring what happens to people’s behaviours after misinformation reaches a mainstream audience are rare. One study from 2015 looked at the effect of an ABC Catalyst episode that misrepresented evidence about cholesterol-lowering drugs — it found fewer people filled their statin prescriptions after the show.

When it comes to COVID-19, researchers are only starting to understand the influence of misinformation on people’s behaviours.

After public discussion about using bleach to potentially treat COVID-19, for instance, the number of internet searches about injecting and drinking disinfectants increased. This was followed by a spike in the number of calls to poison control phone lines for disinfectant-related injuries.

Does countering misinformation online work?

The aim of countering misinformation is not to change the opinions of the people posting it, but to reduce misperceptions among the often silent audience. Public health organisations promoting the benefits of vaccinations on social media consider this when they decide to engage with anti-vaccine posts.

A study published this month by two American researchers, Emily Vraga and Leticia Bode, tested the effect of posting an infographic correction in response to misinformation about the science of a false COVID-19 prevention method. They found a bot developed with the World Health Organization and Facebook was able to reduce misperceptions by posting factual responses to misinformation when it appeared.

A common concern about correcting misinformation in this way is that it might cause a backfire effect, leading people to become more entrenched in misinformed beliefs. But research shows the backfire effect appears to be much rarer than first thought.

Vraga and Bode found no evidence of a backfire effect in their study. Their results suggest that responding to COVID-19 misinformation with factual information is likely to do more good than harm.

So, what’s the best strategy?

Social media platforms can address COVID-19 misinformation by simply removing or labelling posts and deplatforming users who post it.

This is probably most effective in situations where the user posting the misinformation has a small audience. In these cases, responding to misinformation with facts in a more direct way may be a waste of time and could unintentionally amplify the post.

When misinformation is shared by people like Kelly who are in positions of power and influence, removing those posts is like cutting a head off a hydra. It doesn’t stop the spread of misinformation at the source and more of the same will likely fill the void left behind.

In these instances, governments and organisations should consider directly countering misinformation where it occurs. To do this effectively, they need to consider the size of the audience, respond to the misinformation and not the person, and present evidence in simple and engaging ways.

The government’s current campaign fills an important gap in providing simple and clear information about who should get vaccinated and how. It doesn’t directly address the misinformation problem, but I think this would be the wrong place for that kind of effort, anyway.

Instead, research suggests it might be better to directly challenge misinformation where it appears. Rather than demanding the deplatforming of the people who post misinformation, we might instead think of it as an opportunity to correct misperceptions in front of the audiences that really need it.

Adam Dunn, Associate professor, University of Sydney

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Do Twitter bots spread vaccine misinformation?

Discussion of online misinformation in politics and public health often focuses on the role of bots, organised disinformation campaigns and “fake news”. A closer look at what typical users see and engage with about vaccines reveals that for most Twitter users, bots and anti-vaccine content make up a tiny proportion of their information diet.

Having studied how vaccine information spreads on social media for several years, I think we should refocus our efforts on helping the consumers of misinformation rather than blaming the producers. The key to dealing with misinformation is to understand what makes it important in the communities where it is concentrated.

Vaccine-critical Twitter

In our latest study, published in the American Journal of Public Health, we looked at how people see and engage with vaccine information on Twitter. We showed that while people often see vaccine content, not much of it is critical and almost none comes from bots.

While some other research has counted how much anti-vaccine content is posted on social media, we went a step further and estimated the composition of what people saw and measured what they engaged with. To do this we monitored a set of 53,000 typical Twitter users from the United States. Connecting lists of whom they follow with more than 20 million vaccine-related tweets posted from 2017 to 2019, we were able to track what they were likely to see and what they passed on.

In those three years, a typical Twitter user in the US may have seen 727 vaccine-related tweets. Just 26 of those tweets would have been critical of vaccines, and none would have come from a bot.

While it was relatively infrequent, nearly 37% of users posted or retweeted vaccine content at least once in the three years. Only 4.5% of users ever retweeted vaccine-critical content and 2.1% of users retweeted vaccine content posted by a bot.

For 5.8% of users in the study, vaccine-critical tweets made up most of the vaccine-related content they might have seen on Twitter in those three years. This group was more likely to engage with vaccine content in general and more likely to retweet vaccine-critical content.

Studying people, not posts

Many social media analyses about misinformation are based on counting the number of posts that match a set of keywords or hashtags, or how many users have joined public groups. Analyses like these are relatively easy to do.

However, these numbers alone don’t tell you anything about the impact of the posts or groups. A tweet from an account with no followers or a blog post on a website that no one visits is not the same as a major news article, a conversation with a trusted community member, or advice from a doctor.

Information consumption is hard to observe at scale. My team and I have been doing this for many years, and we have developed some useful tools in the process.

In 2015 we found that a Twitter user’s first tweet about HPV vaccines is more likely to be critical if they follow people who post critical content. In 2017, we found lower rates of HPV vaccine uptake across the US were associated with more exposure to certain negative topics on Twitter.

A study published in Science in 2019 used a similar approach and found fake news about the 2016 US election made up 6% of relevant news consumption. That study, like ours, found engagement with fake news was concentrated in a tiny proportion of the population.

I also think analyses focused on posts are popular because it is convenient to be able to blame “others”, including organised disinformation campaigns from foreign governments or reality TV hosts, even when the results don’t support the conclusion. But people prone to passing along misinformation don’t live under bridges eating goats and hobbits. They are just people.

Resisting health misinformation online

When researchers move beyond counting posts to learn why people participate in communities, we can find new ways to empower people with tools to help them resist misinformation. Social media platforms can also find new ways to add friction to sharing any posts that have been flagged as potentially harmful.

While there are unresolved challenges, the individual and social psychology of debunking misinformation is a mature field. Evidence-based guides on debunking conspiracy theories in online communities are available. Focusing on the places where people encounter misinformation will help to better connect data science and behavioural research.

Connecting these fields will help us understand what makes misinformation salient instead of just common in certain communities, and to decide when debunking it is worthwhile. This is important because we need to prioritise cases where there is potential for harm. It is also important because calling out misinformation can unintentionally help it gain traction when it might otherwise fade away.

Vaccination rates remain a problem in places where there are higher rates of vaccine hesitancy and refusal, and are at higher risk of outbreaks. So let’s focus on ways to give people in vulnerable populations the tools they need to protect themselves against harmful information.

Adam Dunn, Associate professor, University of Sydney

This article is republished from The Conversation under a Creative Commons license. Read the original article.

trial2rev: seeing the forest for the trees in the systematic review ecosystem

tl;dr: we have a new web-based system called trial2rev that links published systematic reviews to included and relevant trials based on their registrations in ClinicalTrials.gov. Our report on this system has been published today in JAMIA Open. The first aim is to make it easy for systematic reviewers to monitor the status of ongoing and completed trials that are likely to be included in the updates of systematic reviews, but we hope the system will be able to do much more than just that.

Skip to the subsection trial2rev below for the details or continue reading here for the background.


Systematic reviews are facing a weird kind of crisis at the moment. For years the problem was that systematic reviews were time consuming and resource intensive, and clinical trials and studies were being published at such a rate that it was impossible to do enough systematic reviewing to keep up. Back in 2010, Hilda Bastian and colleagues wrote “Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up?“, and in the same year Fraser and Dunstan wrote “On the impossibility of being expert“. Seeing it as an automation problem, a range of different folk—including smart computer scientists and information retrieval experts—decided to try and speed up the individual processes that make up systematic reviews.

What is the problem?

Some of the most recent methods for screening articles for inclusion in a systematic review are clever and will likely be implemented to make a difference in how individual systematic reviews are performed. But we struggle to take it beyond a support role (say, as a second independent screener) because we don’t have enough training data. Most novel methods have used data from 24 or fewer systematic reviews. In a 2017 example, Shekelle and others proposed a neat new method for screening articles that might be included in a systematic review. The training and testing data that underpinned that method? It was included and excluded studies from just 3 systematic reviews.

Back in 2012, I was writing in Science Translational Medicine about what the future of evidence synthesis might look like—doing away with cumbersome systematic reviews and creating a utopia of shared access to patient-level clinical trial data with an ecosystem of open source software tools to support the rapid incorporation of new trial data into systematic reviews at the press of a button. That obviously hasn’t happened yet. What we didn’t really think about at the time was the quality and integrity of systematic reviews. That there would be so many bad systematic reviews being done that it might drown out the good ones while still leaving other clinical questions un-synthesised. Fast-forward to 2016 and we have at least one person suggesting that the vast majority of systematic reviews being published are unnecessary, redundant, or misleading.

So there are two major problems here. First, the more resources we throw at systematic reviews and the better we get at automating the individual processes that go into them, the more likely it is that systematic reviewers will continue to flood the space with reviews that are unnecessary, redundant, or misleading. Second, with so few training and testing examples being used and no one openly sharing data linking systematic reviews to included/excluded trial articles en masse, I don’t think we will ever get to the place we need to be to truly transform the way we do systematic reviews and stem the tide of garbage being published.

Can we fix it?

There have been some efforts aimed at helping systematic reviewers decide if and when they should actually do a systematic review. There was even a useful tool developed in 2013 that uses empirical data from lots of old systematic reviews and their updates to try and guess whether a systematic review is likely to need updating.

Back in 2005, Ida Sim absolutely had the right idea when she proposed a global trial bank for reporting structured results, and even did some work on automatically extracting key information from trial reports.

Structured and machine-readable representations of trials and tools for knowing when to update an individual systematic review are a good start, but here’s what else I think needs to be done:

  • The design and development of new and more general empirical methods for predicting which systematic reviews actually need to be updated based on likelihood of conclusion change.
  • New journal policies that integrate the use and reporting of these tools, which will hopefully reduce the number of bad systematic reviews being published, or at least be used to assess and downgrade journals that publish too many unnecessary and redundant systematic reviews.
  • Broad release of clinical trial results data where the outcomes are properly mapped to standardised sets of clinical outcomes, and eventually to include patient-level data made accessible in privacy-protecting ways.
  • Then finally and most importantly, full structured representations not just of trials and their results, but of systematic reviews that include detailed and standardised representations of inclusion and exclusion criteria, and lists of links to the included and excluded studies.

We should end up with completely transparent and machine-readable data covering everything from the details of the systematic reviews linked to the trials they included (or excluded) down to the individual participants used to synthesise the results. With this, we can skip over the band-aid solution of automating individual systematic review processes and move to a new phase in evidence synthesis where trial results are immediately incorporated into a smaller number of trustworthy, complete, living, and useful systematic reviews.

trial2rev

So to give this a bit of a nudge in the right direction, we created a shared space for humans and AI to work together to fill in some of the gaps and take advantage of a set of simple tools that emerge from these data … and then we give it all away for free—even including the code for the platform itself.

We designed the trial2rev system to do two things. First, we wanted it to be simple and fast enough so that systematic reviewers can use it to track the status of ongoing and completed trials that might be relevant to the published systematic reviews they want to track; without wasting their time or requiring special expertise.

Second, we designed it to provide access to lots of machine-readable information about trials included in systematic reviews for use in training and testing of new methods for making systematic reviews better. And although the information is currently imperfect and noisy for most systematic reviews, we wanted to make it easy to quickly access a decent-sized sample of verified examples that are known to be complete and correct.

screenshot

Figure. The user interface for engaging with a systematic review. The trial2rev system currently has partial information for more than 12,000 systematic reviews.

We also wanted the system to be efficient and avoid the duplication of effort that currently happens as systematic reviewers go about their business without properly sharing the data they produce. So in our system, when registered users go in to fix up and improve information about their own systematic reviews, everyone else immediately gets to access that information. Not only that, but the more that people use the system, the smarter the machine learning methods get at identifying relevant trials and the less work humans will need to do.

The trial2rev system also helps track similar systematic reviews by looking at the overlap in included trials. We know that for a variety of reasons systematic reviews that include the same trials can end up producing different conclusions. We also know that sometimes systematic reviews are being repeated for the same topic with the same sets of studies simply for the sake of publishing a systematic review rather than to fill a necessary gap in the literature. Our hope is that we can do a better job of monitoring how this has happened in different areas, as well as potentially develop the system as a tool for journal editors to assess the novelty of the systematic reviews manuscript submissions they receive.

The system is a prototype. It is still missing a number of things that would make it more useful, like the ability to add published reports for unregistered trials that were included in systematic reviews; the ability to add systematic review protocols and get back a list of trial registrations that are likely to be included; and bots that use active learning approaches to reduce workload in screening of trial registrations for inclusion in these new systematic reviews.

Some of these will definitely be added in the very near future. Others we have decided not to add specifically because our aim was to reduce our collective reliance on published trial reports and focus more on the use of structured trial results data.

In the meantime, you can check out the article:

P Newman, D Surian, R Bashir, FT Bourgeois, AG Dunn (2018) Trial2rev: Combining machine learning and crowd-sourcing to create a shared space for updating systematic reviewsJAMIA Open, ooy062. [doi:10.1093/jamiaopen/ooy062]

How articles from financially conflicted authors are amplified, why it matters, and how to fix it.

tl;dr: we have a new article on conflicts of interest, published today in JAMA.

Imagine you are attempting to answer a question about your health or the health of someone in your care. The answer isn’t immediately obvious so you search online and find two relevant articles from reputable journals, print them out, and put them both down on a table to read. For the moment, don’t worry whether the articles are clinical studies or systematic reviews [and also don’t worry that nobody actually prints things out any more, you’re ruining the imagery]. You flip through both of them, find that they are very similar in design, but the conclusions are different. How do you choose which one to base your decision on?

You then discover that one of the articles was written by authors who have received funding and consulting fees from the company that stands to profit from the decision you will end up making; the other article was not. Would that influence your decision?

 

Financial conflicts of interest matter. On average, articles written by authors with financial conflicts of interest are different from other articles. Studies on the topic have concluded that financial conflicts of interest have contributed to delays in the withdrawals of unsafe drugs, are an excellent predictor of the conclusions of opinion pieces, and are even associated with conclusions in systematic reviews.

The consequences: around a third of the drugs that are approved will eventually have a safety warning added to them or be withdrawn, and it typically takes more than four years after approval for those issues to be discovered. Globally, we spend billions of extra dollars on drugs, devices, tests, and operations that are neither safer nor more effective than the less expensive care they replaced.

Articles written by authors with financial conflicts of interest are sometimes just advertising masquerading as research. The problem is that if we look at an article in isolation, we usually can’t tell whether it belongs with the sometimes category or not. I will come back to this problem at the end but first…

 

We might not have found all of the financial conflicts of interest. In the research that was published today, we estimated the prevalence of financial conflicts of interests across all recently published biomedical research articles. To do that we sampled articles from the literature (as opposed to sampling journals). To be included, the articles had to have been published in 2017 and they had to have been published in a journal that was both indexed by PubMed and listed with the International Committee of Medical Journal Editors (which means that are supposed to adhere to standards for disclosing conflicts of interest). We also excluded special types of articles including news, images, poetry, and other things that do not directly report or review research findings.

We will definitely have missed some conflicts of interest that should have been there but weren’t disclosed for one reason or another [This is common – there were even missing disclosures in at least one of the articles published in JAMA this week]. It may be that authors failed to disclose all of their relevant conflicts of interest or that journals failed to include authors’ disclosures in the full text of the articles. Some of these will be in the 13.6% of articles that did not include any statement about conflicts of interest but my guess is that most of them will be among the 63.6% of articles where a “none” disclosure was the standard and they were missing either because of laziness, lack of communication among co-authors and journals, or because authors or journals do not understand what a conflict of interest is.

That there is likely to be missing information from what is available in the disclosures is precisely the reason why we need an open, machine-readable system for listing financial interests for every published author of research, and not just the subset of practicing physicians from a single country. More on this later but first…

 

Articles from authors with conflicts of interest get more attention. We found that articles written by authors with financial conflicts of interest tend to be published in journals with higher impact factors, and have higher Altmetric scores compared to articles where none of the authors have financial conflicts of interest.

You are clever, so the first thing you are likely to say when you read that statement is: “Isn’t that because journals with higher impact factors published more big randomised controlled trials so the mix of study types is different?” A good question, but we also checked within the different categories of articles and the difference still holds. For example, if a drug trial is written by someone with a financial conflict of interest, then compared to a drug trial without a conflict of interest, it is much more likely to be in a journal with a higher impact factor and it is much more likely to have been written about in mainstream news media and tweeted a lot more often.

There are a couple of potential consequences.

If you are keeping track of conflicts of interest in the research you read about in the news and on your social feeds, it might feel like most biomedical research is funded by industry or written by researchers who take money from the industry. If you believe that financial conflicts of interest are associated with bias and problems with integrity, then seeing conflicts of interest in most of what you read might make you feel distrustful of research generally. Add that to the sensationalism and over-generalisations that is common in health reporting in the news and on social media, and it is unsurprising that sections of the public take a relatively dim view of biomedical research.

In reality though, most biomedical articles are written by people like me, who have no financial relationship with any particular industry that might end up making more profit by making a drug, device, diagnostic tool, diet, or app look safer and more effective than it really is. So we hope this new research can serve as a reminder that most research is undertaken by researchers who have no financial conflict of interest to influence the ways they design and report their experiments.

Being published in a higher impact journal also tends to mean more attention from the research community, regardless of the quality and integrity of the research. While I can’t tell you about it yet, we have more evidence to show that trials in higher impact journals are more likely to make it into systematic reviews quicker than trials in lesser known journals. More attention from within the research community might create a kind of amplification bias, where the results and conclusions of articles written by researchers with financial conflicts of interest have greater influence in reviews, systematic reviews, editorials, guidelines, policies, and other technologies used to help clinicians make decisions with their patients.

 

 

It might seem impossible but there are steps we can take to improve the system. I have already written quite a bit about why we need to make financial conflicts of interest more transparent and accessible, including in a review we published in Research Integrity and Peer Review as well as a thing I wrote for Nature. It is worth rephrasing this argument again here to explain why disclosure alone is not enough to fix the problem.

Back to our first problem again. What are we actually supposed to do with an article if we find that an author has a financial conflict of interest? Completely discounting the research is a mistake and would constitute a serious waste of time and effort (not to mention the risk of being a participant in a trial). But then trusting it completely would also be a problem, because we know that there is a real risk that the research in the article may have been designed specifically to produce a favourable conclusion; that it might be the visible tip of the iceberg under which sits a large mass of unpublished and less favourable results, that the outcomes reported were only the ones that made the intervention look good, or that the conclusions obfuscate the safety issues and exaggerate the efficacy.

I’ve spent many years looking for exactly these issues in clinical studies and systematic reviews and I still have trouble identifying them quickly. And the evidence shows that even the best systematic reviewers with the best tools for measuring risks of bias still can’t explain why industry funded studies are more often favourable than their counterparts.

The current reality is that we use financial conflicts of interest as a signal that we need to be more careful when appraising the integrity of what we are reading. It’s not a great signal but the alternative is to spend hours trying to make sense of the research in the context of all the other research answering the same questions.

This gives us a hint at where to go next. If financial conflicts of interest are a poor signal of the integrity of a published article, then we need a better signal. To do this, we need to make sure that conflict of interest disclosures have the following characteristics:

  • Complete: missing and incorrect disclosure statements mean that any studies we do looking at how they are different are polluted by noise. To make disclosures complete, all authors of research (not just physicians in the US) could declare them in a public registry. That would have the added bonus of saving lots of time when publishing an article.
  • Categorised: a taxonomy for describing different types of conflicts of interest would also improve the quality of the signal. A researcher that relies entirely on funding from a pharmaceutical company is different from a researcher who gets paid to travel to conferences to talk about how great a new drug is, and both are very different from a researcher who once went to a dinner that was partially sponsored by a company.
  • Publicly accessible: There are companies trying to build the infrastructure to capture conflict of interest disclosures per author but they are not public. I think we should couple ORCID and CrossRef to store and update records of conflicts of interest for all researchers.
  • Machine-readable: extracting conflicts of interest, classifying them according to a taxonomy, and identifying associations with conclusions or measures of integrity is incredibly time-consuming (trust me), so if we want to be able to really quantify the difference between articles with and without different types of conflicts of interest, we have to be able to do that using data mining methods.

Together those elements will make it possible to much more precisely measure the uncertainty introduced by the presence of financial conflicts of interest in a cohort of articles. It won’t tell you exactly whether the article you are reading is reliable or not, but it can tell you historically how often articles with the same types of conflicts of interest had conclusions that were substantially different from cohorts of equivalent articles without conflicts of interest.

Going back to the two imaginary articles sitting on your table, now imagine that we have our public registry and the tools you need to precisely label the risk of the articles relative to their designs as well as the authors’ financial conflicts of interest. Instead of wondering which you base your decision on with no guidance, you can instead quickly determine that the conflicts of interest are the type that typically places the work firmly in the sometimes category; it was incredibly unlikely that the authors would have been able to publish a study that didn’t unequivocally trumpet the safety and efficacy of the new expensive thing. Armed with that information, you down-weight the conclusions from that article and base your decision on the more cautious conclusions of the other.

Thinking outside the cylinder: on the use of clinical trial registries in evidence synthesis communities

Clinical trials take a long time to be published, if they are at all. And when they are published, most of them are either missing critical information or have changed the way they describe the outcomes to suit the results they found (or wanted). Despite these problems, the vast majority of the new methods and technologies that we build in biomedical informatics to improve evidence synthesis remain focused on published articles and bibliographic databases.

Simply: no matter how accurately we can convince a machine to screen the abstracts of published articles for us, we are still bound by what doesn’t appear in those articles.

We might assume that published articles are the best source of clinical evidence we have. But there is an obvious alternative. Clinical trial registries are mature systems that have been around for more than a decade, their use is mandated by law and policy for many countries and organisations, and they tell us more completely and more quickly what research should be available. With new requirements for results reporting coming into effect, more and more trials have structured summary results available (last time I checked ClinicalTrials.gov, it was 7.8 million participants from more than 20,000 trials, and that makes up more than 20% of all completed trials).

The reality is that not all trials are registered, not all registered trials are linked to their results, and not all results are posted in clinical trial registries. And having some results available in a database doesn’t immediately solve the problems of publication bias, outcome reporting bias, spin, and failure to use results to help design replication studies.

In a systematic review of the processes used to link clinical trial registries to published articles, we found that the proportions of trials that had registrations was about the same as the proportion of registrations that had publications (when they were checked properly, not the incorrect number of 8.7 million patients you might have heard about). Depending on whether you are an optimist or a pessimist, you can say that what is available in clinical trial registries is just as good/bad as what is available in bibliographic databases.

Beyond that, the semi-structured results that are available in ClinicalTrials.gov are growing rapidly (by volume and proportion). The results data (a) help us to avoid some of the biases that appear in published research; (b) can appear earlier; (c) can be used to reconcile published results; and (d) as it turns out, make it much easier to convince a machine to screen for trial registrations that meet a certain set of inclusion criteria.

I suspect that the assumption that clinical trial registries are less useful than the published literature is a big part of the reason why nearly all of the machine learning and other natural language processing research in the area is still stuck on published articles. But that is a bad assumption.

Back in 2012, we wrote in Science Translational Medicine about how to build an open community of researchers to build and grow a repository of structured clinical trial results data. The following is based on that list but with nearly 6 years of hindsight:

  • Make it accessible: And not just in the sense of being open, but by providing tools to make it easier to access results data; tools that support data extraction, searching, screening, synthesis. We already have lots of tools in this space that were developed to work with bibliographic databases, and many could easily be modified to work with results data from clinical trial registries.
  • Make all tools available to everyone: The reason why open source software communities work so well is that by sharing, people build tools on top of other tools on top of methods. It was an idea of sharing that was borne of necessity back when computers were scarce, slow, and people had to make the most of their time with them. Tools for searching, screening, cleaning, extracting, and synthesising should be made available to everyone via simple user interfaces.
  • Let people clean and manage data together: Have a self-correcting mechanism that allows people to update and fix problems with the metadata representing trials and links to articles and systematic reviews, even if the trial is not their own. And share those links, because there’s nothing worse than duplicating data cleaning efforts. If the Wikipedia/Reddit models don’t work, there are plenty of others.
  • Help people allocate limited resources: If we really want to reduce the amount of time it takes to identify unsafe or ineffective interventions, we need to support the methods and tools that help the whole community identify the questions that are most in need of answers, and decide together how best to answer them, rather than competing to chase bad incentives like money and career progression. Methods for spotting questions with answers that may be out of date should become software tools that anyone can use.
  • Make it ethical and transparent: There are situations where data should not be shared, especially when we start thinking about including individual participant data in addition to summary data. There are also risks that people may use the available data to tell stories that are simply not true. Risks related to ethics, privacy, and biases need to be addressed and tools need to be validated carefully to help people avoid mistakes whenever possible.

We are already starting to do some of this work in my team. But there are so many opportunities for people in biomedical informatics to think beyond the bibliographic databases, and to develop new methods that will transform the way we do evidence synthesis. My suggestion: start with the dot points above. Take a risk. Build the things that will diminish the bad incentives and support the good incentives.

Differences in exposure to negative news media are associated with lower levels of HPV vaccine coverage

Over the weekend, our new article in Vaccine was published. It describes how we found links between human papillomavirus (HPV) vaccine coverage in the United States and information exposure measures derived from Twitter data.

Our research demonstrates—for the first time—that locations with Twitter users who saw more negative news media had lower levels of HPV vaccine coverage. What we are talking about here is the informational equivalent of: you are what you eat.

There are two nuanced things that I think make the results especially compelling. First, they show that Twitter data does a better job of explaining differences in coverage than socioeconomic indicators related to how easy it is to access HPV vaccines. Second, that the correlations are strongest for initiation (getting your first dose) than for completion (getting your third dose). If we go ahead and assume that information exposure captures something about acceptance, and that socioeconomic differences (insurance, education, poverty, etc.) signal differences in access, then the results really are suggesting that acceptance probably matters, and that we may have a reasonable way to measure it through massive scale social media surveillance.

predictions_2015_data

Figure: Correlations between models that use Twitter-derived information exposure data and census-derived socioeconomic data to predict state-level HPV vaccine coverage in the United States (elastic net regression, cross-validation; check the paper for more details).

It took us quite a long time to get this far. We have been collecting tweets about HPV vaccines since 2013. And not just the (well over) 250K tweets; we also collected information about users who were following the people who posted those tweets. Well over 100 million unique users. We then looked at the user profiles of those people to see if we could work out where they came from. Overall, we located 273.8 million potential exposures to HPV vaccine tweets for 34 million Twitter users in the United States.

screen-shot-2016-11-12-at-10-45-37-pm

Figure:  The volume of information exposure counts for Twitter users identified at the county level in the United States, by percentile, up to 19.7 million in New York County, NY.

Each tweet was assigned to one of 31 topics, and then we determined state-level differences in exposure to those topics. For example, Rhode Island was much more often exposed to tweets about legal changes in Rhode Island. Users in New York State had higher proportional exposure to tweets related to a series of articles published about the representation of HPV vaccines in the Toronto Star. States like West Virginia and Arkansas had a greater proportion of exposures to tweets about a controversial story that aired on a television show hosted by Katie Couric. A higher proportion of the exposures in Kentucky, Utah, and Texas were related to politics/industry rhetoric related to vaccine policy in Texas.

1-s2.0-S0264410X17305522-gr2

Figure: The five topics with the strongest negative correlations with HPV vaccine coverage at the state level in the United States (2015 National Immunization Survey). Mainstream news media features heavily among the topics with the strongest correlations.

The tl;dr—by characterising the information that a sample of Twitter users from each state were exposed to during a two year period, we were able to predict which states would have higher or lower rates of HPV vaccine coverage in the same period of time.

An extra note on computational epidemiology research

We hear about quite a number of studies that use social media data (Facebook, Twitter, Instagram, Reddit, Foursquare, etc.) to try and answer health-related research questions. Despite the volume, only a tiny handful of them have demonstrated a clear link between the kinds of data we can gather from social media and real world health outcomes at the population level.

The one that was most closely related to what we did with these tweets recently was published in Psychological Science and predicted heart disease mortality at the county level in the United States based on language used on Twitter. Eichstaedt and colleagues found that topics that were related to hostility, agression, boredom, and fatigue were positively correlated with heart disease mortality, and topics related to skilled occupations, positive experiences, and optimism were negatively correlated with heart disease mortality. They too found that socioeconomic predictors were less powerful in explaining differences in mortality than Twitter.

Other population-level research linking information posted on social media with real world health outcomes tends to fit in the scope of predicting influenza rates by location. I won’t go through all of these here, but these are probably the best represented in the research literature.

There are a number of other studies that use social media to predict health outcomes (or diagnoses) at the individual level, and these tend to consider who users communicate with and the language they use in their posts to predict things like major depressive disorders, alcohol use, suicidal ideation, and other psychological conditions.

So what does it all mean?

First, the research presented in the our new article is the computational epidemiology equivalent of basic research.

The research cannot tell us whether news and social media influences the vaccine decision-making of a population, or if it reflects people’s preference for reading information sources that they already agree with. It is probably a bit of both.

Second, the way we might operationalise the methods we have started to develop here is important. It is fine to know which topics appear to resonate with certain communities by doing location inference and demographic inference over user populations in social media [and just wait until you see what we are doing with location inference on Twitter and user modelling for conspiracy beliefs in Reddit]. But there is still a lot of hard work to be done to turn the information into actionable outcomes; informing the design of targeted media interventions and then tracking their effectiveness in situ.

For vaccination, we are still grappling with the evidence about how access and acceptance influence decision-making at the individual and population levels. For social influence, we still haven’t worked out the best ways to measure the contributions of homophily, contagion, and external factors to changing behaviours or opinions. We still don’t know whether social media data can be leveraged to better understand how media interventions influence other kinds of health behaviours. We still don’t know how much actual evidence about safety and efficacy makes its way into the information diets of various populations, and we don’t know whether populations are capable of distinguishing between high quality clinical evidence and the academic equivalent of junk advertising.

What we do know is that if we want to promote cost-effective healthcare, we need to continue to push for more research into the complicated cross-disciplinary world of preventive medicine—that weird mix of evidence-based medicine, epidemiology, social psychology, public health, journalism, and data science.