Hypothesis

The COVID-19 pandemic has not only had an unprecedented effect on community life, but has also significantly impacted the global economy. From retirement investments and small business operations, to large revenue impacts on the world’s foremost corporations, this pandemic has certainly fostered much difficulty. Such difficulty further exacerbates global economic obstacles.

Building on available R libraries covered in previous course exercises, this project intends to utilize natural language processing (NLP) and sentiment analysis to predict the performance of the United States economy. To achieve this, this project will examine tweets from Twitter, analyze their sentiments, and see if a connection exists between digital social sentiments and stock market performance.

The following hypothesis and research questions guided this project’s research and technical development:

Hypothesis: Twitter sentiments influence stock market performance during a global pandemic.
Research Question 1: Are Twitter sentiments able to accurately predict stock market performance during a global pandemic?
Research Question 2: Do global Twitter sentiments have a significant impact on the United States’ economic performance?
Research Question 3: Do positive and negative sentiments correlate with increases or decreases in market performance?

Method

To limit scope, this project will solely examine the S&P 500 index, a strong indicator of domestic economic performance as a whole. The tweets, however, will be collected on a global scale. Given the significant impact of global outlooks on the foremost economies, in addition to the global presence of the COVID-19 pandemic, this was a most logical inclusion with regard to data collection and analysis.

Tools: This project’s data collection was a semi-manual data collection effort, interfaced through the Twitter API and the R programming language. All development was conducted in the RStudio integrated development environment (IDE). To best approach data analysis and visualization outputs, a number of R libraries were used, in addition to some external development resources. Such resources are either mentioned in the “Code”, or cited in “References.” The R libraries used included NLP, twitteR, syuzhet, tm, RColorBrewer, ROAuth, wordcloud, and ggplot2.

Collection: This project consisted of an extensive collection effort conducted over the course of a business week. Prior to official analysis, a test run was completed to ensure the script functioned properly.

This project focuses on two prominent Twitter hashtags, #coronavirus and #COVID19. Given the relatively equal engagement with these two hashtags (an assertion made from social platform trend observations, rather than evidenced fact), it was important to analyze both. This also provided an interesting sub-study of sorts - seeing which, if either, had a more significant impact on stock market performance.

For each of the aforementioned hashtags, 1,000 tweets were collected daily, for 5 days, during the April 20, 2020 – April 24, 2020 date range. Thus, a total of 2,000 tweets were collected daily between both hashtags, bringing the 5-day total to 10,000 tweets. The collection was limited to 5 weekdays, reflecting the active days of the United States stock market. Apart from the total number of collected tweets (n), other search and collection parameters adhered to the following considerations: search terms were limited to the aforementioned hashtags, tweets were bound by the aforementioned date range, and all tweets had to be in English. Given the study’s emphasis on global impact, no geocode parameters were specified in the search function

Outputs: This project produced original visualizations in the form of word clouds, sentiment bar plots, and sentiment line plots. S&P500 charts for the specified dates were externally sourced.

Considerations: Given the sensitivity of health-related topics and the vast range of sentiments evident in social posts, raw tweets were only in possession for the duration of the collection period. Once all final visualizations were created and confirmed to be consistent with script parameters, all raw tweets were securely deleted, despite them being public facing. Furthermore, the text processing portions of the R script ensures to remove any and all user identifiers from collected tweets, such that the analysis functions only focus on the true essence of a given tweet.

Code

To complete the Twitter sentiment analysis for this project, the R language was used with accompanying libraries and the Twitter API. Below, you’ll find the script responsible for natural language processing (NLP), sentiment analysis, and visualization outputs. The libraries used included NLP, twitteR, syuzhet, tm, RColorBrewer, ROAuth, wordcloud, and ggplot2. The script is commented throughout, describing each function’s primary purpose.

      
    # Load the following R libraries
    library(NLP)
    library(twitteR)
    library(syuzhet)
    library(tm)
    library(RColorBrewer)
    library(ROAuth)
    library(wordcloud)
    library(ggplot2)
    
    # Set working directory
    setwd(dir="/Users/XXXX/Downloads")
    
    # Connect to Twitter API
    consumer_key <- "XXXX"
    consumer_secret <- "XXXX"
    access_token <- "XXXX"
    access_secret <- "XXXX"
    setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
    
    #Conduct Twitter searches for 1000 tweets per hashtag
    tweets_corona <- searchTwitter("#coronavirus", n=1000,lang = "en", 
                                  since="2020-04-20", until="2020-04-24")
    tweets_covid <- searchTwitter("#COVID19", n=1000,lang = "en", 
                                  since="2020-04-20", until="2020-04-24")
    
    # Convert tweet list to an actionable data frame
    corona_df <- twListToDF(tweets_corona)
    covid_df <- twListToDF(tweets_covid)
    
    # View data frame to ensure searches were successful
    View(corona_df)
    View(covid_df)
    
    # Save Tweet data frame backups (Delete after projkect analysis is complete)
    write.csv(corona_df, "corona.csv")
    write.csv(covid_df, "covid.csv")
    
    # Assign tweet text to variable
    corona_text<- corona_df$text
    covid_text<- covid_df$text
    
    # Create text corupus for each of the hashtag's tweet text from above
    coronaCorpus <- Corpus(VectorSource(corona_df$text))
    coronaCorpus <- tm_map(coronaCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
    
    covidCorpus <- Corpus(VectorSource(covid_df$text))
    covidCorpus <- tm_map(covidCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
    
    # Make lowercase
    coronaCorpus <- tm_map(coronaCorpus, content_transformer(tolower))
    covidCorpus <- tm_map(covidCorpus, content_transformer(tolower))
    
    # Delete punctuation, numbers, and whitespace
    coronaCorpus <- tm_map(coronaCorpus, removePunctuation)
    covidCorpus <- tm_map(covidCorpus, removePunctuation)
    coronaCorpus <- tm_map(coronaCorpus, removeNumbers)
    covidCorpus <- tm_map(covidCorpus, removeNumbers)
    coronaCorpus <- tm_map(coronaCorpus, stripWhitespace)
    covidCorpus <- tm_map(covidCorpus, stripWhitespace)
    
    # Partial processing function from CateGitau (Github)
    Textprocessing <- function(x)
    {gsub("http[[:alnum:]]*",'', x)
      gsub('http\\S+\\s*', '', x) ## Remove URLs
      gsub('#\\S+', '', x) ## Remove Hashtags
      gsub('@\\S+', '', x) ## Remove Mentions
      gsub('[[:cntrl:]]', '', x) ## Remove Controls and special characters
      gsub("\\d", '', x) ## Remove Controls and special characters
    }
    
    # Apply text processing to corpus
    coronaCorpus <- tm_map(coronaCorpus,Textprocessing)
    covidCorpus <- tm_map(covidCorpus,Textprocessing)
    
    # Remove stopwords from predetermined stopwords list 
    # (also includes common Twitter terms - ie. RT, like, etc.)
    tweetStopwords <- readLines("stopwords-big")
    coronaCorpus <- tm_map(coronaCorpus,removeWords,mystopwords)
    covidCorpus <- tm_map(covidCorpus,removeWords,mystopwords)
    
    # Create wordcloud
    coronaCloud <- wordcloud(coronaCorpus,min.freq = 10,colors=brewer.pal(8, "Dark2"),
                              random.color = TRUE,max.words = 1000)
    covidCloud <- wordcloud(covidCorpus,min.freq = 10,colors=brewer.pal(8, "Dark2"),
                              random.color = TRUE,max.words = 1000)
    
    # Sentiment analysis
    # Retrieving sentiments from tweet text
    mysentiment_corona<-get_nrc_sentiment((corona_text))
    mysentiment_covid<-get_nrc_sentiment((covid_text))
    
    # Sentiment score calculations
    Sentimentscores_corona<-data.frame(colSums(mysentiment_corona[,]))
    Sentimentscores_covid<-data.frame(colSums(mysentiment_covid[,]))
    
    # Name and bind rownames
    names(Sentimentscores_corona)<-"Score"
    Sentimentscores_corona<-cbind("sentiment"=rownames(Sentimentscores_corona),Sentimentscores_corona)
    rownames(Sentimentscores_corona)<-NULL
    
    names(Sentimentscores_covid)<-"Score"
    Sentimentscores_covid<-cbind("sentiment"=rownames(Sentimentscores_covid),Sentimentscores_covid)
    rownames(Sentimentscores_covid)<-NULL
    
    # Create sentiment bar plots and save outputs
    ggplot(data=Sentimentscores_corona,aes(x=sentiment,y=Score))
      +geom_bar(aes(fill=sentiment),stat = "identity")+
      theme(legend.position="none")+
      xlab("Sentiment")+ylab("Score")+ggtitle("Sentiments of Tweets Using #coronavirus - (4/XX/2020)")
    ggsave("corona_sentiment.png")
    
    ggplot(data=Sentimentscores_covid,aes(x=sentiment,y=Score))
      +geom_bar(aes(fill=sentiment),stat = "identity")+
      theme(legend.position="none")+
      xlab("Sentiment")+ylab("Score")+ggtitle("Sentiments of Tweets Using #COVID19 - (4/XX/2020)")
    ggsave("covid_sentiment.png")
    
    # Create sentiment line plots and save outputs
    ggplot(Sentimentscores_corona, aes(sentiment, Score, group=1)) + xlab("Sentiment") + ylab("Score") 
      + geom_line(size=2, color="#0081e3") + ggtitle("Sentiments of Tweets Using #coronavirus - (4/XX/2020)")
    ggsave("corona_line.png")
    
    ggplot(Sentimentscores_covid, aes(sentiment, Score, group=1)) + xlab("Sentiment") + ylab("Score") 
      + geom_line(size=2, color="#a321ff") + ggtitle("Sentiments of Tweets Using #COVID19 - (4/XX/2020)")
    ggsave("covid_line.png")

Data

Although it would have been beneficial to conduct sentiment analysis on a much larger and publicly available data set, available tweet collections did not meet the hashtag criteria set forth by this project’s requirements. As such, it was necessary to collect tweets through the official Twitter API. Although the total number of collected tweets between the two hashtags reached the five-figures, Twitter API limitations prevented additional collection for basic API permissions. Furthermore, the limitations into past tweet collection (i.e. not being able to collect beyond a week in the past) required that tweets be collected daily, at a consistent time throughout the collection period. Tweets were collected near market close (~3 PM EST), each day during the specified date range. Utilizing the searchTwitter function outlined in “Code,” the number of collected tweets was set at 1,000 per hashtag. This parameter resulted in 10,000 total tweets between both hashtags, over the five days of collection. Collected tweets were limited to those in English.

As detailed in the R script in “Code,” the following visualization plots were created: sentiment bar plots, sentiment line plots, and word clouds. The visualizations and accompanying analyses can be found in the “Findings” section below. The sentiment bar plots aid in understanding the range of the emotions beyond the typical ‘positive’ and ‘negative.’ As such, they may provide greater insight into the impacts of particular sentiments on market performance. The word clouds represent a strong alternative to privacy-inhibiting raw tweet spreadsheets. Given that the raw tweets were securely deleted, the word clouds highlight the key sentiments and terms present in a majority of the collected tweets.

To provide users with a live look into the COVID-19 pandemic in the United States, the dashboard below was embedded from the John Hopkins University COVID-19 website. Their tracking efforts are among the most accurate, with their interactive dashboard enabling precise filtering and daily case count updates.

Findings

Given the variety of ways in which the COVID-19 topic is approached on social media platforms, it was important to consider the two most prominent tweet categorizations, or ‘hashtags.’ Tweets for both #coronavirus and #COVID19 were collected, providing a holistic view into global sentiments, while also providing an interesting comparison between the two hashtags. Click through each of the tabs below to reveal the overall findings of a hashtag, in addition to daily analysis. Under each day’s findings, you’ll have access to a variety of visualizations, including a word cloud of recurring terms, sentiment bar and line plots, and an externally sourced S&P500 market chart. Click to enlarge any of the visualizations for a more detailed look.

For each of the five days of collection, #coronavirus sentiment visualizations showed positive sentiments overtaking all other sentiments. Despite this, the market performed well for only two days (days three and five). Days two and three showed near equal positive-to-negative sentiment ratios, indicating potential sentiment conflicts during those times. These conflicting sentiments could have contributed to the weaker market performance on day two and stronger performance on day three. Lastly, for a majority the days, much of the more specific sentiments were grouped towards the negative end of the sentiment plots, providing a potential correlation to the weak market performance for three out of the five days.

The first day of collection presented several key findings. Approaching the word cloud, the majority of the key terms seem largely negative and speculative towards the pandemic. The sentiment distribution and line plot present interesting behaviors with relation to actual market performance. The large increase in positive sentiments can be closely associated with the stronger market performance throughout most of the day. However, the sharp decline in positively associated sentiments is reflected through the significantly sharp drop in market performance near closing. Thus, despite having a larger number of positive sentiments among collected tweets, these variations of positively and negatively associated sentiments do seem to have an impact on market performance.

In comparison to the first day of collection, the second day’s word cloud highlights a greater amount positive terms from the tweet sample. Day two’s sentiment show a near equal positive-to-negative sentiment ratio, with a fairly even dispersion of positive and negative sentiments. This behavior corresponds with the variable performance of the market, in which a steep drop was followed by a few instances of notable rises.

Day three was met with an uptick in overall market performance. It was the second of two days in which positive and negative sentiments closely mirrored one another. Although its word cloud did not reveal any clusters of overly positive or negative terms, there was a notable dispersion of positive and negative sentiments across the sample. Apart from the twin peaks at the center of the plot, sentiments on both the positive and negative sentiment ends appeared equally dispersed.

Despite the larger presence of positive sentiments among the sample, day four’s market performance was largely variable. The market ended noticeably weaker than days prior, potentially reflecting the greater number of negative sentiments in the word cloud, in addition to the many negatively associated sentiments in the plots. The steep decline following the peak of the plots resembles the significant drop in market performance near midday.

Day five was largely representative of the project’s hypothesis, in which a large presence of positive sentiments coincided with a significant rise in market performance. Interestingly, the word cloud largely consisted of topics related to the pandemic. Unlike previous word clouds, no positive or negative sentiments were especially frequent (a characteristic represented by the physical size of a word in the word cloud).

Resembling #coronavirus, #COVID19, the official name of the disease, yielded leads in positive sentiments throughout the five-day collection period. In other words, positive sentiments exceeded negative sentiments for each of the five days, with no instances of near equal sentiments. Interestingly, the largest positive sentiment leads occurred during strong market performance days (days three and five), indicating a possible correlation. Despite the prominence of positive sentiments, the #COVID19 word clouds were largely reflective of those from #coronavirus, with a mix of trending terms and varying sentiments.

Despite the notable positive sentiment score, the market remained fairly stable throughout most of the day. However, it closed significantly lower than its opening mark, with a steep decline towards the end of the day. The sentiment line plot exhibits such a drop off, given the presence of most sentiments on the negative sentiment end of the plot. The word cloud appears to contain greater negatively associated terms and trends, corresponding with the weaker market performance.

The word cloud for day two seemed to present a more neutral cluster of frequent terms among the sample, indicating a lack of frequent positive or negative terms. The sentiment plots were largely representative of day one, corresponding with weak market performance overall. The gradual decline from market open to midday, followed by gradually stronger performance, resembles the behavior exhibited by the sentiment line plot.

Day three was one of two days to demonstrate significant positive-to-negative sentiment ratios, in which the positive sentiments were far greater than previous days. This large increase in positive sentiments corresponded with a far stronger market, one which gradually increased throughout the day. Day three’s word cloud did not contain overly positive or negative sentiment terms, but rather trending terms related to the pandemic. Some terms hinted at a positive outlook among the sample, such as those related to healthcare, social distancing, and treatments.

Day four experienced volatile market performance, with several instances of steep rises and declines. As such, the sentiment plots were evidently scattered, despite the leading positive sentiment score. All additional positive and negative sentiments were fairly equal on either side of the plots, indicating potential conflicts of attitude. These conflicts can potentially be linked towards the volatile market performance. The word cloud for day four did not reveal anything surprising or inconsistent from days prior.

Consistent with patterns uncovered from day three, day five’s positive sentiment score exceeded all other sentiment scores by a notable margin. This significant increase corresponded with strong market performance from midday to market close. Interestingly, the majority of the data points remain on the negative end of the sentiment plot. However, this occurrence was consistent with day three, possibly due to the large margin of difference evident in the positive-to-negative sentiment ratio. The word cloud was consistent with previous days, not revealing anything especially significant about the day’s tweet sentiments.

Plot Utility: Although the plots were not completely representative of market performance, they were successful in exhibiting behaviors that correlated with particular market outcomes. As such, they were able to hint at the possibility of pandemic scoped social sentiments influencing economic outcomes. Thus, with regard to the research questions, particular portions of the plots were representative of market performance, but not exact. There were instances in which sentiments determined the general sense of the market’s outlook for a given day, and sentiment growth patterns exhibited similarities with market performance trends.

Recommendations: Although this project did not show complete predictive ability, trends among sentiment plots did show hints of likely correlation. For those interested in further researching social sentiments and expanding on this project, several rectifications can be made. To begin, expanding the collection period beyond five days would allow for a greater sample, avoiding any instances of uncommon outliers. Next, utilizing R to create choropleth heat maps for sentiments would better reveal the sentiment densities of users for a given hashtag. This would provide a practical look into sentiment dispersion, rather than mixed sentiment word clouds containing a variety of sentiments in an unorganized manner. Finally, expanding collection and analysis methodologies to incorporate automation would be beneficial for increasing project efficiency, such that a majority of project time can be dedicated towards uncovering analytical insights.

Areas for Future Research: This project has much potential for future research expansions. With technological advances in computation processing and analysis, automating the collection and analysis process would allow for the reveal of potentially ‘hidden’ correlations among the samples. Computational analysis can render small and seemingly insignificant details as mathematically meaningful and significant. Building on such evidence would allow for the calculation of an accurate predictive model, enabling researchers, investors, and developers to gain invaluable insights into digital social impacts on economic outlooks. Such potential fosters much opportunity for applications of artificial intelligence and machine learning, giving new meaning and utility to social media posts.

Final Takeaways: Although the predictive relationship of tweet sentiments and stock market performance was not as accurate as hypothesized, the trends presented a key takeaway - digital social sentiments do exert a certain influence on global outlooks, as evidenced by the vast availability of tagged posts. Such effects shape global economic performance and hint at global attitudes towards progress. Thus, emphasizing the value of such sentiments provides an invaluable look into the cogs of globalized function, highlighting the potential of recognition and collaboration in overcoming insurmountable obstacles.

References

Several references contributed to the development and function of this technological analysis effort. All accredited script functions and project consultation resources are included. To reveal the reference list, click the "Show References" button below.

	Source
1	Cookbook for R. (n.d.). Bar and line graphs (ggplot2). Cookbook for R. Retrieved from http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)
2	CRAN R-Project. (2019). Sentiment analysis of Twitter data. CRAN R-Project. Retrieved from https://cran.r-project.org/web/packages/saotd/vignettes/saotd.html
3	Dataturks. (2020). Text processing and sentiment analysis of Twitter data. Hackernoon. Retrieved from https://hackernoon.com/text-processing-and-sentiment-analysis-of-twitter-data-22ff5e51e14c
4	Fernando, S. (2020). Project consultation and guidance. Foundations of Data Science. Retrieved from https://comminfo.rutgers.edu/fernando-suchinthi
5	Gentry, J. (2018). searchTwitter. RDocumentation. Retrieved from https://www.rdocumentation.org/packages/twitteR/versions/1.1.9/topics/searchTwitter
6	Gitau, C. (2017). Twitter text analysis (text processing function). Github. Retrieved from https://gist.github.com/CateGitau/05e6ff80b2a3aaa58236067811cee44e
7	Johns Hopkins University. (2020). Coronavirus Resource Center. Johns Hopkins University. Retrieved from https://coronavirus.jhu.edu/us-map
8	R-bloggers. (2011). Adding lines or points to an existing barplot. R-bloggers. Retrieved from https://www.r-bloggers.com/adding-lines-or-points-to-an-existing-barplot/
9	Rascia, T. (2016). Adding syntax highlighting to code snippets in a blog or website. Tania Rascia. Retrieved from https://www.taniarascia.com/adding-syntax-highlighting-to-code-snippets/
10	RStudio Publications. (n.d.). Text Mining example codes (tweets). RStudio Publications. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/66739_c4422a1761bd4ee0b0bb8821d7780e12.html
11	Sipra, V. (2020). Twitter sentiment analysis and visualization using R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/twitter-sentiment-analysis-and-visualization-using-r-22e1f70f6967
12	Twitter. (2020). Twitter developer docs (Twitter API). Twitter Developer. Retrieved from https://developer.twitter.com/en/docs
13	University of Virginia. (2019). An introduction to analyzing Twitter data with R. University of Virginia Library – Github. Retrieved from https://uvastatlab.github.io/2019/05/03/an-introduction-to-analyzing-twitter-data-with-r/
14	Van den Rul, C. (2019). A guide to mining and analyzing tweets with R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/a-guide-to-mining-and-analysing-tweets-with-r-2f56818fdd16
15	Van den Rul, C. (2019). How to Generate Word Clouds in R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a
16	Wassner, L. & Farmer, C. (2019). Text mining Twitter data with TidyText in R. Earth Lab. Retrieved from https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/
17	Yahoo! Finance. (2020). S&P 500 (^GSPC). Yahoo! Finance. Retrieved from https://finance.yahoo.com/quote/%5EGSPC?p=%5EGSPC

Hypothesis

Method

Code

Data

Findings

#coronavirus

#coronavirus

Day 1

Day 2

Day 3

Day 4

Day 5