Twitter users with anti-vaccine opinions are relatively easy to spot if we can measure their misinformation exposure

So…I have been systematically collecting tweets about human papillomavirus (HPV) vaccines since October 2013. We now have over two hundred thousand tweets that included keywords related to HPV vaccines, and the first of two pieces of research we have undertaken using these data has just been published in the Journal of Medical Internet Research. It covers 6 months, 83,551 tweets from 30,621 users connected to each other through 957,865 social connections. The study question is a relatively simple one – we wanted to find out about how many people are tweeting “anti-vaccine” opinions about HPV vaccines, the diversity of their concerns, and how the misinformation exposure is distributed throughout the Twitter communities.

What we found was in some ways surprising – around 24% of the tweets about HPV vaccines were classified as “negative” (more on this later). To me, this seems like a very large proportion given that only around 2% of adults are actually refusing vaccinations for their children. In other ways, I’m less surprised because of how many people have so many other unusual beliefs, and the number of surveys that suggest that 20% to 30% of adults believe that vaccines cause autism.

Looking at how people follow each other within the group of 30,621 users, we found that around 29% of everyone who tweeted about HPV vaccines were exposed to a majority of these “negative” tweets because of who they follow.

To classify the tweets as either “negative” or “neutral/positive”, we used supervised machine learning classifiers that were slightly different to the normal kinds of classifiers that just use information about the text to examine the sentiment of a tweet. I’ll be talking about these machine learning classifiers at the MEDINFO conference in Sao Paulo this August.

What we really wanted to know was how many Twitter users were being exposed to this negative kind of information – usually anecdotes about harm, conspiracy theories, complete fabrications, or some strange amalgamation of all of them – whether these users mostly grouped together, and how far their information reached across communities that might be making decisions about HPV vaccines for themselves or their children.

A network of 30,621 Twitter users who posted tweets about HPV vaccines in a six month period. Users in orange were exposed mostly to negative opinions. Circles are users, larger ones have more followers within this group of users. Users more closely connected are generally positioned closer to each other in the picture.

We also wanted to know a bit more about the reach of the actual science and clinical evidence that is being published in the area. As researchers, we know that there are now studies showing that the HPV vaccine is safe and that there is early evidence of effectiveness in the prevention of cervical cancer, but we don’t really know who might be “exposed” to that kind of information.

Perhaps unsurprisingly, the people producing the science of HPV vaccines were located pretty much as far away as they could possibly be from the people exposed mostly to negative opinions. Most of the tweets linked directly to peer-reviewed articles came from the people in the very top left section of the network illustration above.

The main contribution of our study was to determine how much more likely it is that a user who was previously exposed to negative opinions would be to then tweet a negative opinion. The answer was: “a lot more”.

But to address the reasons why users’ opinions were relatively easy to predict if we know about the information they were exposed to in the past, we have to do a lot more work…

It could be that the opinions were “contagious” and spread through the community. It might also be that people end up forming “homophilous” connections with other users who express the same negative opinions about HPV vaccines. The much more likely explanation is that people who share opinions about all kinds of other things besides HPV vaccines (like guns, religion, politics, conspiracies, organic vegetables, crystals, and magical healing water) are more likely to be connected to each other, and their opinions about HPV vaccines are due to the breadth of misinformation that spreads to them from influential news organisations, celebrities, friends, and magical water practitioners.

It is important that we are careful to explain that the study only demonstrates an association between what people are exposed to in the past, and the direction of their expressed opinions after that. It does not show causation, and it does not tell us how those people came to believe what they do.

The study does tell us something important about how we might be able to estimate risks of poor vaccination decision-making within particular communities in space and time. One of the things we would like to be able to do is to examine where the concentrations of misinformation exposure are distributed geographically in a couple of countries (US and Australia – because that is where we know best), as a way of helping public health organisations better understand who might be vaccine anxious (or at risk of becoming vaccine anxious), and the specific concerns they might have. Because remember, only 2% of adults are conscientiously refusing to vaccinate their children, but an awful lot more of them might be forming their opinions based on the awful misinformation that spreads through the communities they inhabit.

The Health Informatics Conference 2012, Sydney

So #hic12 is nearly here and I’ll be there in a rather unusual capacity. I won’t be giving a talk. I won’t even be standing in front of a poster. I’ll be there as the official twitterer, which means I’ll be flitting around from talk to talk, tweeting from the official @hic_2012 account, and hopefully connecting people in the sort of decentralised organisational process we’ve all come to love about the medium. It’s on from the 30th of July to the 2nd of August and the details are, you know, on the website.

So what’s health informatics all about? Well, at its essence, it’s really about helping doctors, medical practitioners, and clinical researchers do better medicine. Sometimes it’s also about helping patients to help themselves. And pretty much always, it’s about information – spreading it, keeping it private, fitting it together, and using it to improve things.

For all the money thrown around in supporting new technology in healthcare delivery, we don’t seem to have made the sort of progress you might expect for such a critical part of the community – the bit that looks after you when you’re sick. So when you talk to people from outside medicine and healthcare about what actually happens in hospitals and practices around Australia, it’s not a surprise that they’re shocked.

“So the system is paperless, right?” Not even close.

“So the systems aren’t even connected to share information *within* the hospital?” Nope.

“So, I can’t register for an electronic record if my name has a hyphen or an apostrophe?” Haha! no.

It’s hard to believe that this is how things are in healthcare when in the rest of our day-to-day lives we can just download apps on devices to recognise a song/picture we hear/see on the street, connect to people around the world instantaneously, stream live videos of protests to thousands, run away from imaginary zombies to motivate us to stay fit and healthy, and ask Siri to tell us what gets prescribed to patients like us if we visit a doctor. But when it comes to changing technology in the sacred world of medicine there are a few things that get in the way – safety, bureaucracy, the cultural status quo, and profiteering.

And it’s those things that I always want to hear about at conferences on health informatics. Instead of asking what amazing things could be done with the new and ubiquitous technology we have surrounding us, we tend to ask and answer the following:

“How will you make sure that it’s safe?” It will take us many years to evaluate its safety but first we need ethics approval, which will also take way too long.

“How will you know for sure if it is effective and worth the cost?” We will have to test it in the real world, which is in a constant state of flux, so, ummm, actually, we won’t really be able to tell you how effective it is anyway – we’ll guess.

“And it will only cost you a billion dollars!” What?

“How will you convince clinicians to use it?” Oh, there will be resistance. People prefer to maintain the status quo because they work in tightly-constrained worlds with little room to move and adapt. So yeah, there will be resistance.

Meanwhile, there are some impressive people doing some rather amazing things to address the problems, break down the bureaucracy where it isn’t needed, and generally make the kinds of changes to the system that we can be proud of. Quite a few of them will even be at the Health Informatics Conference in Sydney at the end of July.

If you’re going to be there, I’d love to hear from you, find out what your Twitter account is, and add your talk or poster to my tweeting itinerary. If you don’t have a Twitter account and you work in health informatics, I’d like to know why. And most importantly I’d love to ask you how your work addresses or side-steps some of the above problems. I’m looking for disruptive technologies.

The death of Wikipedia? Does it mean anything?

It is now old news that contributions in the form of new articles and edits have slowed dramatically since the end of 2006. The change was so dramatic that it could (should) not be interpreted as a ‘saturation of knowledge’ but is much more likely to come from the more mundane – that contributors had moved on to other forms of social interaction that can consume as many hours as one is willing to give (yes, I’m probably talking about Twitter, even though the massive growth in Twitter started two years later).

So here is what the Wikipedia slow-down looks like, in the number of characters per month (the reason for characters instead of articles/words will become apparent). I have also included the distribution of sizes for Wikipedia articles content, by character.

Wikipedia decline in article contributions

Further evidence that the slow-down is related to the amount of time humans are willing to spend on the endeavour comes from the parallel (but not mutually exclusive) slow-down in the number of edits. What this means for Wikipedia is relatively simple:

It is unlikely that Wikipedia will be able to keep up with the vast human knowledge enterprise if it becomes stable or shrinks further.

So the question is: Should we be concerned about the slow down in a repository as important as Wikipedia? And can we learn from it when forecasting about Twitter (and other crowd-sourced streams of knowledge), or perhaps more importantly, the evidence base for biomedical literature as a whole?

So let’s now look at another very large repository of well-linked knowledge, called PubMed. PubMed is a reasonable reflection of modern biomedical knowledge as a whole but only around 10% of that knowledge is actually fully accessible for free (via PubMed Central, for example). I used an even smaller subset (the Open Access subset of PubMed Central) to estimate the median size of PubMed articles and use that highest peak you can see (because the other peaks are for articles where only the pdf is available and are therefore not a true reflection of the length).

PubMed growth and article size distribution

So which of Twitter (a stream of mostly redundant information, with super-linear growth) or Wikipedia (a curated repository with very little redundant information, and slowing growth) is more likely to resemble a fast-forwarded version of biomedical literature? Is biomedical literature a stream or a repository? Does it contain a lot of redundant information or a little? Is it likely to stabilise in the future once we deal with the issue of information overload? Or perhaps, does the production of biomedical literature operate under completely separate set of principles to other modern streams of knowledge by virtue of it’s place in history and perceived importance?

These are not rhetorical questions. If you have an idea about this, tweet me at @adamgdunn because I’d love to hear what you think.

Full access to the Twitter API in Matlab via R

[Update: This page is no longer relevant. If you are here to interact with Twitter via the API using Matlab, then you want Twitty. The rest is here for posterity.]

Having slowly degraded my ability to interact with proper operating systems and obscure programming languages (e.g. NSFW), I find it difficult to keep up with “modern” ways of programming. So when it comes to doing something that might be conceivably trivial for serious programmers, I tend to struggle.

One example is accessing the complete Twitter API, including those searches that involve being authorised/authenticated via OAuth. So while I had no problem doing search queries and tweeting via a nice Matlab function, I was unable to find a simple way to find the followers/friends of public users and their retweets/replies using the parts of the API that required authentication/authorisation. So once I did, I thought it would be appropriate to pass along my approach so that it would be available to other fervent Matlab users who may wish to do the same.

And since I have long lost the ability to do anything complicated in programming, this will be necessarily be a beginner’s guide to accessing Twitter via MATLAB. I skip over much of the details and specifics but I hope to cover the particular steps that tripped me up along the way.

1. Create an app on Twitter. The most important thing to remember on this step is to leave the callback URL blank. Or delete it afterwards, which is what I did after much too much mucking around trying to work out why I could not access a PIN (more on this later). The other thing you will need to remember is to include Read and Write privileges.

2. If you don’t already have it, download and install R. I’m using version 2.14.0. In my version under Windows, I immediately installed the ROAuth and twitteR packages from within the R application.

3. Run the following commands inside R. Rather than elaborate on these here, it will be better if you peruse the documentation and examples associated with the packages to understand how they work [because I don’t]. Note that I am downloading a cacert file, which will be used later on [I think it is necessary].

  • setwd(“D:\blah\some-directory\\”); [remember to use \ for directories]
  • library(twitteR)
  • library(ROAuth)
  • download.file(url=””, destfile=”cacert.pem”) [make sure it ends up in the right place]
  • KEY <- “********************” [consumer key from your twitter app]
  • SECRET <- “***********************************” [consumer secret from your twitter app]
  • cred <- OAuthFactory$new(consumerKey = KEY,
  •     consumerSecret = SECRET,
  •     requestURL = “”,
  •     accessURL = “”,
  •     authURL = “”)
  • cred$handshake(cainfo=”cacert.pem”)

At this point you will be presented with a statement containing a URL to go and find a PIN from Twitter. If you have set up your application correctly (no callback URL), you can simply navigate to that website and copy and paste your pin number inside.

4. Inside R again, save the OAuth object (cred) to a suitable filename. I have saved mine as Cred.RData. The command is “save”. It’s easy to find.

5. Create a new R script with the following commands:

  • setwd(“D:\blah\some-directory\\”)
  • args <- commandArgs(TRUE)
  • library(twitteR)
  • library(ROAuth)
  • load(‘D:\blah\some-directory\Cred.RData’)
  • cred$isVerified()
  • print(cred$OAuthRequest(args[1], “GET”, ssl.verifypeer = FALSE))

You will notice that this script includes “args[1]”, which we are going to pass from within MATLAB when we call the script. You will also notice that we have loaded the old OAuth object, which means that you will no longer need to enter the PIN each time you want to access the Twitter API.

Those of you who are following carefully will also notice that this is a particularly unsafe way of requesting information from Twitter, and is prone to man-in-the-middle attacks. I can’t imagine how the resulting strings could be dangerous, and there is no private information contained in what is being sent around, so I am comfortable with this until I am convinced otherwise.

6. In MATLAB, simply create the html you will use to call the Twitter API, for example, noting the method for producing the correct quotation marks (without it, the command line will not know how to interpret your ampersands correctly):

htmlx = ‘”″’;

In this case, the html request will collect up a certain Australian politician’s last two tweets, which always make for interesting reading.

7. Then, to ask R to run the script you have written. To do this, you need only run the following command from within MATLAB (or within a function in MATLAB, of course):

  • [status,result] = system([‘D:ProgramsR-2.14.0binRScript —vanilla —no-save —slave query_script.R ’ htmlx]);

This will return a bunch of junk that you won’t need to use, as well as the tweets/followers/friends whatever you have requested in your properly-formed htmlx variable, passed as an argument to, and parsed by, R.

8. I won’t go into the details of how you can then strip the results of this call to produce what you might be looking for because this depends on the specific API calls you are making. As for me, I have created a miniature library that implements specific API calls, and then devours the results using simple regular expressions to produce structures for the returned tweets and users.

I am unlikely to make the rest of the code for this public in the near future and I don’t plan to answer questions about this [because there are experts who will be able to do a much better job than I can] but if you decide you really need to contact me, then it is not terribly difficult to find my email address, or you can tweet me at @adamgdunn.

Language communities – another use for Twitter

This visualisation by Eric Fischer is wonderful in its simplicity and a quick zoom around the (very big) original-sized version reveals some amazing information about the languages being spoken in different places around the world. Of course Europe is very interesting with the clear borders and the unusually high density of tweets in the Netherlands – but there are also some very interesting observations to be made about diversity around the world.

Europe language twitter visualisation

“Sipping from the fire hose” – sampling Twitter streams

A quick link for today showing a visualisation of Twitter showing a more practical explanation of a nation’s (presumably referring to the UK in the picture below) mood. I think it is interesting and beautiful.