Using Twitter to Predict COVID-19 in New Jersey



TL;DR: This project’s main objective was to utilize Twitter sentiment analysis to predict the prominence of COVID-19 cases in each of New Jersey’s 21 counties. This project has positive implications on the way we approach outbreak identification and hints at the communal power of individual sentiments.


Introduction & Background


This project stemmed from prior course experimentation with Twitter sentiment analysis. Recognizing the potential of computational analytical power, the plethora of available libraries and APIs enabling such a methodology to be applied, and the prominence of COVID-19 around the globe, it was decided that COVID-19 tweets would be analyzed. To limit the scope of the project and determine whether sentiment analysis predictions possessed any practical viability for COVID-19 cases, this project was limited to the State of New Jersey and its 21 counties.


Hypothesis & Research Question


  • Research Question: Can positive and negative Twitter sentiments influence the number of COVID-19 cases in New Jersey, at the county level?
  • H0 (Null Hypothesis): Twitter sentiments do not influence the number of COVID-19 cases in New Jersey counties.
  • H1 (Alternative Hypothesis): Twitter sentiments do influence the number of COVID-19 cases in New Jersey counties.
    • In other words, counties with a greater negative-to-positive sentiment ratio will have more COVID-19 cases than counties with a greater positive-to-negative sentiment ratio.

Method


  • Collection Process: To collect and analyze the sentiments of tweets, in addition to constructing choropleth maps, a Twitter Developer API access token was obtained, which was then utilized through the R language.
    • Environment: R scripts were written in the RStudio integrated development environment (IDE), in which the following libraries were imported: NLP, twitteR, syuzhet, tm, ROAuth, and ggplot2.
    • Search Parameters: Tweets were collected based on the presence of the “coronavirus” search term. This term was chosen over COVID-19, due to greater tweet availability. The collected tweets were limited to the 04/12/2020-04/18/2020 time frame. 150 tweets were collected for each county, in English. The search process was conducted a total of 21 times, accommodating each of the 21 counties in New Jersey. To differentiate between the tweets, the geocode (latitude and longitude) of each county were used as an additional search argument (reference “Code” for structure). To calculate the appropriate radius of collection, the area of each county was collected and subsequently used as a final search parameter within the searchTwitter function.
  • Data Cleaning: Prior to sentiment analysis, the raw tweets went through several rounds of text cleaning, including the removal of all URLs, punctuation, usernames, and stop words. This was so that the sentiment analysis could focus solely on the true essence of a given tweet, increasing the likelihood of identifying either a positive or negative sentiment.
  • Omissions: Despite the collected tweets being public, to protect user privacy, several considerations were made. Apart from geolocation and time series information, all usernames from tweet authors and mention replies were removed. Furthermore, the spreadsheets containing the raw tweets have not been posted to this website, nor any other online resource. Instead, these spreadsheets were securely deleted, after the sentiment analysis was successfully conducted.
  • Obstacles: The number of tweets, or ‘n’ in the searchTwitter function, was limited to 150. This was done due to the lack of available tweets in particular geolocations. To remain consistent, 150 was the maximum n-value that allowed the searchTwitter function to collect tweets. This may be due to several reasons, not limited to a lack of active Twitter users in a county, Twitter users in a county not enabling tweet geocoding, population scarcity in particular counties, and a general ‘digital social’ disinterest in the subject matter.

Areas for Future Research


Being able to recognize the impacts of social sentiments on community outbreaks has the potential to greatly impact the ways in which we go about outbreak case recognition. Although not statistically supported, this project has several routes of future expansion, of which some of the most prominent include:

  • Platform Differentiation: Expanding sentiment analysis efforts to different social media platforms increases the likelihood of determining the accuracy of digital social sentiments on outbreak case projections. Including ‘atypical social media platforms,’ such as forums, blogs, and town websites, with web scrapers and search engine-like crawlers, has the potential to uncover even more sentiment correlations among communities. This could prove fundamental in identifying growth trends and patterns, and potentially, patient-zeros.
  • Social Network Maps: Identifying the spread of social sentiments across a community will not only reveal the source of particular sentiments, but also determine the influence of communities within a particular geographic region. Such mapping could allow investigators and researchers alike to recognize the very beginnings of an outbreak or significant event, proving invaluable to analytical and investigative efforts.
  • Artificial Intelligence, Automation, & Machine Learning: Automating the sentiment analysis and tweet collection process, while also incorporating advanced computational ‘thinking,’ has the potential to increase predictive and correlative abilities, beyond the individual researcher level. Such innovations can undoubtedly handle a greater influx of social sentiment and geographic information, enabling researchers to expand this project’s search efforts to a global scale.

Takeaways


Although this project does not statistically support a predictive model for tweet sentiments predicting New Jersey county case outbreaks, the methodology itself has much potential. Inconsistencies in tweet availability may have influenced the accuracy of this model, but do reveal interesting patterns. As evidence by the choropleth maps, sentiments, while not playing a direct role in case prediction, do characterize attitude similarities towards shared community experiences. Thus, perhaps the most revealing finding of this project was that the ways in which we react to events and interact with each other have the potential to shape our communal attitudes towards shared obstacles. Perhaps with the right communal attitude, even the most insurmountable obstacles can be overcome. I’ll leave that hypothesis for another study.