Over the weekend, our new article in Vaccine was published. It describes how we found links between human papillomavirus (HPV) vaccine coverage in the United States and information exposure measures derived from Twitter data.
Our research demonstrates—for the first time—that locations with Twitter users who saw more negative news media had lower levels of HPV vaccine coverage. What we are talking about here is the informational equivalent of: you are what you eat.
There are two nuanced things that I think make the results especially compelling. First, they show that Twitter data does a better job of explaining differences in coverage than socioeconomic indicators related to how easy it is to access HPV vaccines. Second, that the correlations are strongest for initiation (getting your first dose) than for completion (getting your third dose). If we go ahead and assume that information exposure captures something about acceptance, and that socioeconomic differences (insurance, education, poverty, etc.) signal differences in access, then the results really are suggesting that acceptance probably matters, and that we may have a reasonable way to measure it through massive scale social media surveillance.
Figure: Correlations between models that use Twitter-derived information exposure data and census-derived socioeconomic data to predict state-level HPV vaccine coverage in the United States (elastic net regression, cross-validation; check the paper for more details).
It took us quite a long time to get this far. We have been collecting tweets about HPV vaccines since 2013. And not just the (well over) 250K tweets; we also collected information about users who were following the people who posted those tweets. Well over 100 million unique users. We then looked at the user profiles of those people to see if we could work out where they came from. Overall, we located 273.8 million potential exposures to HPV vaccine tweets for 34 million Twitter users in the United States.
Figure: The volume of information exposure counts for Twitter users identified at the county level in the United States, by percentile, up to 19.7 million in New York County, NY.
Each tweet was assigned to one of 31 topics, and then we determined state-level differences in exposure to those topics. For example, Rhode Island was much more often exposed to tweets about legal changes in Rhode Island. Users in New York State had higher proportional exposure to tweets related to a series of articles published about the representation of HPV vaccines in the Toronto Star. States like West Virginia and Arkansas had a greater proportion of exposures to tweets about a controversial story that aired on a television show hosted by Katie Couric. A higher proportion of the exposures in Kentucky, Utah, and Texas were related to politics/industry rhetoric related to vaccine policy in Texas.
Figure: The five topics with the strongest negative correlations with HPV vaccine coverage at the state level in the United States (2015 National Immunization Survey). Mainstream news media features heavily among the topics with the strongest correlations.
The tl;dr—by characterising the information that a sample of Twitter users from each state were exposed to during a two year period, we were able to predict which states would have higher or lower rates of HPV vaccine coverage in the same period of time.
An extra note on computational epidemiology research
We hear about quite a number of studies that use social media data (Facebook, Twitter, Instagram, Reddit, Foursquare, etc.) to try and answer health-related research questions. Despite the volume, only a tiny handful of them have demonstrated a clear link between the kinds of data we can gather from social media and real world health outcomes at the population level.
The one that was most closely related to what we did with these tweets recently was published in Psychological Science and predicted heart disease mortality at the county level in the United States based on language used on Twitter. Eichstaedt and colleagues found that topics that were related to hostility, agression, boredom, and fatigue were positively correlated with heart disease mortality, and topics related to skilled occupations, positive experiences, and optimism were negatively correlated with heart disease mortality. They too found that socioeconomic predictors were less powerful in explaining differences in mortality than Twitter.
Other population-level research linking information posted on social media with real world health outcomes tends to fit in the scope of predicting influenza rates by location. I won’t go through all of these here, but these are probably the best represented in the research literature.
There are a number of other studies that use social media to predict health outcomes (or diagnoses) at the individual level, and these tend to consider who users communicate with and the language they use in their posts to predict things like major depressive disorders, alcohol use, suicidal ideation, and other psychological conditions.
So what does it all mean?
First, the research presented in the our new article is the computational epidemiology equivalent of basic research.
The research cannot tell us whether news and social media influences the vaccine decision-making of a population, or if it reflects people’s preference for reading information sources that they already agree with. It is probably a bit of both.
Second, the way we might operationalise the methods we have started to develop here is important. It is fine to know which topics appear to resonate with certain communities by doing location inference and demographic inference over user populations in social media [and just wait until you see what we are doing with location inference on Twitter and user modelling for conspiracy beliefs in Reddit]. But there is still a lot of hard work to be done to turn the information into actionable outcomes; informing the design of targeted media interventions and then tracking their effectiveness in situ.
For vaccination, we are still grappling with the evidence about how access and acceptance influence decision-making at the individual and population levels. For social influence, we still haven’t worked out the best ways to measure the contributions of homophily, contagion, and external factors to changing behaviours or opinions. We still don’t know whether social media data can be leveraged to better understand how media interventions influence other kinds of health behaviours. We still don’t know how much actual evidence about safety and efficacy makes its way into the information diets of various populations, and we don’t know whether populations are capable of distinguishing between high quality clinical evidence and the academic equivalent of junk advertising.
What we do know is that if we want to promote cost-effective healthcare, we need to continue to push for more research into the complicated cross-disciplinary world of preventive medicine—that weird mix of evidence-based medicine, epidemiology, social psychology, public health, journalism, and data science.