Hypothesis
The COVID-19 pandemic has not only had an unprecedented effect on community life, but has also significantly impacted the global economy. From retirement investments and small business operations, to large revenue impacts on the world’s foremost corporations, this pandemic has certainly fostered much difficulty. Such difficulty further exacerbates global economic obstacles.
Building on available R libraries covered in previous course exercises, this project intends to utilize natural language processing (NLP) and sentiment analysis to predict the performance of the United States economy. To achieve this, this project will examine tweets from Twitter, analyze their sentiments, and see if a connection exists between digital social sentiments and stock market performance.
The following hypothesis and research questions guided this project’s research and technical development:
- Hypothesis: Twitter sentiments influence stock market performance during a global pandemic.
- Research Question 1: Are Twitter sentiments able to accurately predict stock market performance during a global pandemic?
- Research Question 2: Do global Twitter sentiments have a significant impact on the United States’ economic performance?
- Research Question 3: Do positive and negative sentiments correlate with increases or decreases in market performance?
Method
To limit scope, this project will solely examine the S&P 500 index, a strong indicator of domestic economic performance as a whole. The tweets, however, will be collected on a global scale. Given the significant impact of global outlooks on the foremost economies, in addition to the global presence of the COVID-19 pandemic, this was a most logical inclusion with regard to data collection and analysis.
Tools: This project’s data collection was a semi-manual data collection effort, interfaced through the Twitter API and the R programming language. All development was conducted in the RStudio integrated development environment (IDE). To best approach data analysis and visualization outputs, a number of R libraries were used, in addition to some external development resources. Such resources are either mentioned in the “Code”, or cited in “References.” The R libraries used included NLP, twitteR, syuzhet, tm, RColorBrewer, ROAuth, wordcloud, and ggplot2.
Collection: This project consisted of an extensive collection effort conducted over the course of a business week. Prior to official analysis, a test run was completed to ensure the script functioned properly.
This project focuses on two prominent Twitter hashtags, #coronavirus and #COVID19. Given the relatively equal engagement with these two hashtags (an assertion made from social platform trend observations, rather than evidenced fact), it was important to analyze both. This also provided an interesting sub-study of sorts - seeing which, if either, had a more significant impact on stock market performance.
For each of the aforementioned hashtags, 1,000 tweets were collected daily, for 5 days, during the April 20, 2020 – April 24, 2020 date range. Thus, a total of 2,000 tweets were collected daily between both hashtags, bringing the 5-day total to 10,000 tweets. The collection was limited to 5 weekdays, reflecting the active days of the United States stock market. Apart from the total number of collected tweets (n), other search and collection parameters adhered to the following considerations: search terms were limited to the aforementioned hashtags, tweets were bound by the aforementioned date range, and all tweets had to be in English. Given the study’s emphasis on global impact, no geocode parameters were specified in the search function
Outputs: This project produced original visualizations in the form of word clouds, sentiment bar plots, and sentiment line plots. S&P500 charts for the specified dates were externally sourced.
Considerations: Given the sensitivity of health-related topics and the vast range of sentiments evident in social posts, raw tweets were only in possession for the duration of the collection period. Once all final visualizations were created and confirmed to be consistent with script parameters, all raw tweets were securely deleted, despite them being public facing. Furthermore, the text processing portions of the R script ensures to remove any and all user identifiers from collected tweets, such that the analysis functions only focus on the true essence of a given tweet.
Code
To complete the Twitter sentiment analysis for this project, the R language was used with accompanying libraries and the Twitter API. Below, you’ll find the script responsible for natural language processing (NLP), sentiment analysis, and visualization outputs. The libraries used included NLP, twitteR, syuzhet, tm, RColorBrewer, ROAuth, wordcloud, and ggplot2. The script is commented throughout, describing each function’s primary purpose.
# Load the following R libraries
library(NLP)
library(twitteR)
library(syuzhet)
library(tm)
library(RColorBrewer)
library(ROAuth)
library(wordcloud)
library(ggplot2)
# Set working directory
setwd(dir="/Users/XXXX/Downloads")
# Connect to Twitter API
consumer_key <- "XXXX"
consumer_secret <- "XXXX"
access_token <- "XXXX"
access_secret <- "XXXX"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#Conduct Twitter searches for 1000 tweets per hashtag
tweets_corona <- searchTwitter("#coronavirus", n=1000,lang = "en",
since="2020-04-20", until="2020-04-24")
tweets_covid <- searchTwitter("#COVID19", n=1000,lang = "en",
since="2020-04-20", until="2020-04-24")
# Convert tweet list to an actionable data frame
corona_df <- twListToDF(tweets_corona)
covid_df <- twListToDF(tweets_covid)
# View data frame to ensure searches were successful
View(corona_df)
View(covid_df)
# Save Tweet data frame backups (Delete after projkect analysis is complete)
write.csv(corona_df, "corona.csv")
write.csv(covid_df, "covid.csv")
# Assign tweet text to variable
corona_text<- corona_df$text
covid_text<- covid_df$text
# Create text corupus for each of the hashtag's tweet text from above
coronaCorpus <- Corpus(VectorSource(corona_df$text))
coronaCorpus <- tm_map(coronaCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
covidCorpus <- Corpus(VectorSource(covid_df$text))
covidCorpus <- tm_map(covidCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
# Make lowercase
coronaCorpus <- tm_map(coronaCorpus, content_transformer(tolower))
covidCorpus <- tm_map(covidCorpus, content_transformer(tolower))
# Delete punctuation, numbers, and whitespace
coronaCorpus <- tm_map(coronaCorpus, removePunctuation)
covidCorpus <- tm_map(covidCorpus, removePunctuation)
coronaCorpus <- tm_map(coronaCorpus, removeNumbers)
covidCorpus <- tm_map(covidCorpus, removeNumbers)
coronaCorpus <- tm_map(coronaCorpus, stripWhitespace)
covidCorpus <- tm_map(covidCorpus, stripWhitespace)
# Partial processing function from CateGitau (Github)
Textprocessing <- function(x)
{gsub("http[[:alnum:]]*",'', x)
gsub('http\\S+\\s*', '', x) ## Remove URLs
gsub('#\\S+', '', x) ## Remove Hashtags
gsub('@\\S+', '', x) ## Remove Mentions
gsub('[[:cntrl:]]', '', x) ## Remove Controls and special characters
gsub("\\d", '', x) ## Remove Controls and special characters
}
# Apply text processing to corpus
coronaCorpus <- tm_map(coronaCorpus,Textprocessing)
covidCorpus <- tm_map(covidCorpus,Textprocessing)
# Remove stopwords from predetermined stopwords list
# (also includes common Twitter terms - ie. RT, like, etc.)
tweetStopwords <- readLines("stopwords-big")
coronaCorpus <- tm_map(coronaCorpus,removeWords,mystopwords)
covidCorpus <- tm_map(covidCorpus,removeWords,mystopwords)
# Create wordcloud
coronaCloud <- wordcloud(coronaCorpus,min.freq = 10,colors=brewer.pal(8, "Dark2"),
random.color = TRUE,max.words = 1000)
covidCloud <- wordcloud(covidCorpus,min.freq = 10,colors=brewer.pal(8, "Dark2"),
random.color = TRUE,max.words = 1000)
# Sentiment analysis
# Retrieving sentiments from tweet text
mysentiment_corona<-get_nrc_sentiment((corona_text))
mysentiment_covid<-get_nrc_sentiment((covid_text))
# Sentiment score calculations
Sentimentscores_corona<-data.frame(colSums(mysentiment_corona[,]))
Sentimentscores_covid<-data.frame(colSums(mysentiment_covid[,]))
# Name and bind rownames
names(Sentimentscores_corona)<-"Score"
Sentimentscores_corona<-cbind("sentiment"=rownames(Sentimentscores_corona),Sentimentscores_corona)
rownames(Sentimentscores_corona)<-NULL
names(Sentimentscores_covid)<-"Score"
Sentimentscores_covid<-cbind("sentiment"=rownames(Sentimentscores_covid),Sentimentscores_covid)
rownames(Sentimentscores_covid)<-NULL
# Create sentiment bar plots and save outputs
ggplot(data=Sentimentscores_corona,aes(x=sentiment,y=Score))
+geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiment")+ylab("Score")+ggtitle("Sentiments of Tweets Using #coronavirus - (4/XX/2020)")
ggsave("corona_sentiment.png")
ggplot(data=Sentimentscores_covid,aes(x=sentiment,y=Score))
+geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiment")+ylab("Score")+ggtitle("Sentiments of Tweets Using #COVID19 - (4/XX/2020)")
ggsave("covid_sentiment.png")
# Create sentiment line plots and save outputs
ggplot(Sentimentscores_corona, aes(sentiment, Score, group=1)) + xlab("Sentiment") + ylab("Score")
+ geom_line(size=2, color="#0081e3") + ggtitle("Sentiments of Tweets Using #coronavirus - (4/XX/2020)")
ggsave("corona_line.png")
ggplot(Sentimentscores_covid, aes(sentiment, Score, group=1)) + xlab("Sentiment") + ylab("Score")
+ geom_line(size=2, color="#a321ff") + ggtitle("Sentiments of Tweets Using #COVID19 - (4/XX/2020)")
ggsave("covid_line.png")
Data
Although it would have been beneficial to conduct sentiment analysis on a much larger and publicly available data set, available tweet collections did not meet the hashtag criteria set forth by this project’s requirements. As such, it was necessary to collect tweets through the official Twitter API. Although the total number of collected tweets between the two hashtags reached the five-figures, Twitter API limitations prevented additional collection for basic API permissions. Furthermore, the limitations into past tweet collection (i.e. not being able to collect beyond a week in the past) required that tweets be collected daily, at a consistent time throughout the collection period. Tweets were collected near market close (~3 PM EST), each day during the specified date range. Utilizing the searchTwitter function outlined in “Code,” the number of collected tweets was set at 1,000 per hashtag. This parameter resulted in 10,000 total tweets between both hashtags, over the five days of collection. Collected tweets were limited to those in English.
As detailed in the R script in “Code,” the following visualization plots were created: sentiment bar plots, sentiment line plots, and word clouds. The visualizations and accompanying analyses can be found in the “Findings” section below. The sentiment bar plots aid in understanding the range of the emotions beyond the typical ‘positive’ and ‘negative.’ As such, they may provide greater insight into the impacts of particular sentiments on market performance. The word clouds represent a strong alternative to privacy-inhibiting raw tweet spreadsheets. Given that the raw tweets were securely deleted, the word clouds highlight the key sentiments and terms present in a majority of the collected tweets.
To provide users with a live look into the COVID-19 pandemic in the United States, the dashboard below was embedded from the John Hopkins University COVID-19 website. Their tracking efforts are among the most accurate, with their interactive dashboard enabling precise filtering and daily case count updates.
Findings
Given the variety of ways in which the COVID-19 topic is approached on social media platforms, it was important to consider the two most prominent tweet categorizations, or ‘hashtags.’ Tweets for both #coronavirus and #COVID19 were collected, providing a holistic view into global sentiments, while also providing an interesting comparison between the two hashtags. Click through each of the tabs below to reveal the overall findings of a hashtag, in addition to daily analysis. Under each day’s findings, you’ll have access to a variety of visualizations, including a word cloud of recurring terms, sentiment bar and line plots, and an externally sourced S&P500 market chart. Click to enlarge any of the visualizations for a more detailed look.
Plot Utility: Although the plots were not completely representative of market performance, they were successful in exhibiting behaviors that correlated with particular market outcomes. As such, they were able to hint at the possibility of pandemic scoped social sentiments influencing economic outcomes. Thus, with regard to the research questions, particular portions of the plots were representative of market performance, but not exact. There were instances in which sentiments determined the general sense of the market’s outlook for a given day, and sentiment growth patterns exhibited similarities with market performance trends.
Recommendations: Although this project did not show complete predictive ability, trends among sentiment plots did show hints of likely correlation. For those interested in further researching social sentiments and expanding on this project, several rectifications can be made. To begin, expanding the collection period beyond five days would allow for a greater sample, avoiding any instances of uncommon outliers. Next, utilizing R to create choropleth heat maps for sentiments would better reveal the sentiment densities of users for a given hashtag. This would provide a practical look into sentiment dispersion, rather than mixed sentiment word clouds containing a variety of sentiments in an unorganized manner. Finally, expanding collection and analysis methodologies to incorporate automation would be beneficial for increasing project efficiency, such that a majority of project time can be dedicated towards uncovering analytical insights.
Areas for Future Research: This project has much potential for future research expansions. With technological advances in computation processing and analysis, automating the collection and analysis process would allow for the reveal of potentially ‘hidden’ correlations among the samples. Computational analysis can render small and seemingly insignificant details as mathematically meaningful and significant. Building on such evidence would allow for the calculation of an accurate predictive model, enabling researchers, investors, and developers to gain invaluable insights into digital social impacts on economic outlooks. Such potential fosters much opportunity for applications of artificial intelligence and machine learning, giving new meaning and utility to social media posts.
Final Takeaways: Although the predictive relationship of tweet sentiments and stock market performance was not as accurate as hypothesized, the trends presented a key takeaway - digital social sentiments do exert a certain influence on global outlooks, as evidenced by the vast availability of tagged posts. Such effects shape global economic performance and hint at global attitudes towards progress. Thus, emphasizing the value of such sentiments provides an invaluable look into the cogs of globalized function, highlighting the potential of recognition and collaboration in overcoming insurmountable obstacles.
References
Several references contributed to the development and function of this technological analysis effort. All accredited script functions and project consultation resources are included. To reveal the reference list, click the "Show References" button below.
Source | ||
---|---|---|
1 | Cookbook for R. (n.d.). Bar and line graphs (ggplot2). Cookbook for R. Retrieved from http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2) | |
2 | CRAN R-Project. (2019). Sentiment analysis of Twitter data. CRAN R-Project. Retrieved from https://cran.r-project.org/web/packages/saotd/vignettes/saotd.html | |
3 | Dataturks. (2020). Text processing and sentiment analysis of Twitter data. Hackernoon. Retrieved from https://hackernoon.com/text-processing-and-sentiment-analysis-of-twitter-data-22ff5e51e14c | |
4 | Fernando, S. (2020). Project consultation and guidance. Foundations of Data Science. Retrieved from https://comminfo.rutgers.edu/fernando-suchinthi | |
5 | Gentry, J. (2018). searchTwitter. RDocumentation. Retrieved from https://www.rdocumentation.org/packages/twitteR/versions/1.1.9/topics/searchTwitter | |
6 | Gitau, C. (2017). Twitter text analysis (text processing function). Github. Retrieved from https://gist.github.com/CateGitau/05e6ff80b2a3aaa58236067811cee44e | |
7 | Johns Hopkins University. (2020). Coronavirus Resource Center. Johns Hopkins University. Retrieved from https://coronavirus.jhu.edu/us-map | |
8 | R-bloggers. (2011). Adding lines or points to an existing barplot. R-bloggers. Retrieved from https://www.r-bloggers.com/adding-lines-or-points-to-an-existing-barplot/ | |
9 | Rascia, T. (2016). Adding syntax highlighting to code snippets in a blog or website. Tania Rascia. Retrieved from https://www.taniarascia.com/adding-syntax-highlighting-to-code-snippets/ | |
10 | RStudio Publications. (n.d.). Text Mining example codes (tweets). RStudio Publications. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/66739_c4422a1761bd4ee0b0bb8821d7780e12.html | |
11 | Sipra, V. (2020). Twitter sentiment analysis and visualization using R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/twitter-sentiment-analysis-and-visualization-using-r-22e1f70f6967 | |
12 | Twitter. (2020). Twitter developer docs (Twitter API). Twitter Developer. Retrieved from https://developer.twitter.com/en/docs | |
13 | University of Virginia. (2019). An introduction to analyzing Twitter data with R. University of Virginia Library – Github. Retrieved from https://uvastatlab.github.io/2019/05/03/an-introduction-to-analyzing-twitter-data-with-r/ | |
14 | Van den Rul, C. (2019). A guide to mining and analyzing tweets with R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/a-guide-to-mining-and-analysing-tweets-with-r-2f56818fdd16 | |
15 | Van den Rul, C. (2019). How to Generate Word Clouds in R. Towards Data Science – Medium. Retrieved from https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a | |
16 | Wassner, L. & Farmer, C. (2019). Text mining Twitter data with TidyText in R. Earth Lab. Retrieved from https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/ | |
17 | Yahoo! Finance. (2020). S&P 500 (^GSPC). Yahoo! Finance. Retrieved from https://finance.yahoo.com/quote/%5EGSPC?p=%5EGSPC |