Role of Machine Learning and Sentiment Analysis
The advancements of technology and feature-based paradigms have been predominantly enhanced by the inclusion of Machine Learning, which is increasingly getting incorporated within industries (Pineau 2019). It has undoubtedly cemented the way for technical establishments and contraptions, which were unimaginable a few years ago. Some of the actual world applications of machine learning that have transformed several systems fundamentals include features like image recognition, sentiment analysis, product recommendation, assistance in identifying frauds and spam, and more. Various web-based social networks have indeed been widely used as a form of information exchange (Jost et al., 2018) from both individuals and institutions well over the globe. The number of people using social networking applications has been rapidly increasing, especially in the last decade. Over the preceding year, Facebook, Twitter, YouTube, LinkedIn, more all saw significant growth. With 2.8 billion monthly regular members, Facebook is the largest social networking platform, whereas Twitter has roughly 300 million monthly regular subscribers (Sattar and Arifuzzaman 2021). Twitter is quickly gaining popularity among the people and is experiencing rapid growth. Certain individuals utilize the Twitter platform to support various opinions, such as a platform for fighting, ideological objectives, and information exchange, and therefore, it is playing an increasingly important role in societal development. The debate about vaccination advancements, availability, effectiveness, and adverse effects is continuing, and it pervades media articles and Twitter feeds daily.
Back in 2019, several cases related to diseases like pneumonia were reported in Wuhan, China (He, Deng and Li 2020), which rapidly spread all over the globe. Within a few months, the disease was termed Corona Virus Disease or COVID-19 and was declared a pandemic by the World Health Organization. However, once COVID-19 vaccination began to be scaled up, the condition has stabilized. As more proof of vaccination’s good effects on spread emerges, people’s faith will grow. In light of this, evaluating popular sentiment or emotion is critical for pushing people to get the COVID-19 vaccine. This particular document is based on analyzing the feedback or evaluating the public opinions regarding the vaccine-related paradigms through the utilization of ML algorithms.
To conduct the relevant research and evaluation of the functionalities, the assistance of sentiment analysis is taken into account. Sentiment analysis is a computational intelligence technology that looks for valence in text, ranging from good to unfavourable. Machine learning systems acquire how and where to identify sentiment in the absence of human involvement by retraining them with samples of feelings in texts. Broadly said, machine learning helps machines to understand new jobs without having to be explicitly taught to do so (Mahesh 2020). Sentiment analysis models are trained to recognize factors like background, emotions, and inconsistently applied terms in addition to simple meanings. To instruct and teach computers to conduct sentiment analysis, a range of methodologies and complicated algorithms are utilized. Each has advantages and disadvantages. However, when utilized in tandem, they can provide extraordinary outcomes. The aspects of sentiment analysis and its significance are considered a crucial component within the parameters of computational studies. These are currently found helpful to enhance the fundamentals of analysis of different individuals for a particular entity. Zhang, Wang, and Liu (2018) mentioned a clear demonstration of sentiment analysis, where the author stated that sentiment analysis or opinion mining is considered as the computational learning of an individual’s opinion, sentiments, emotional quotient, and attitudes towards several units or beings. The increased inception and quicker expansion of this field conglomerates significantly with those of the social networking platforms accessible through the web. For instance, the aspects or outcomes of sentiment analysis with ML algorithms are probably done based upon the information available from several blogs, posts, tweets, and more. Excessive adoption of this trend has highlighted the current era where large amounts of opinionated data are generated and recorded digitally. After the times of early 2000, features of sentiment analysis have expanded to become one of the most dominant research domains specifically in Natural Language Processing (NLP) (Liu et al., 2019). Fields of sentiment analysis are also widely researched for functionalities of data mining, web mining, text mining, and for extraction of pieces of information from large databases. Based upon the significant benefits it produces after the appropriate inclusion of ML-based algorithms over the data, sentiment analysis has spread from the sectors of computer science to management-related research, financial plannings, political science, and parameters of communication. Suitable assessment of the features sustained by sentiment analysis and going through the existing literature has been included as the major functionality for continuing this research paper.
Role of Social Media in Analyzing Public Sentiments
Subsequent sections of this paper hold COVID-19 vaccination as the centralized entity which has been concentrated for sentiment analysis. As per the considerations and reports of (), social media contributed a major role in analyzing the sentiments of different people after they perceived the relevant doses of vaccine (Melton et al., 2021). For apprehending appropriate results and knowledge regarding the opinion of the public, scrapping the tweets was considered as the best possible approach for sustaining the pertinent analytical standards. The results achieved through the analysis exploiting significant ML algorithms also assisted us in comprehending public reactions, and will subsequently help the policymakers in projecting the consecutive vaccination campaigns and include the appropriate health and safety measures. Medical scientists and politicians can understand well the community’s response to vaccination during the COVID-19 epidemic by analyzing Twitter activity. It also clarifies people’s perspectives on the healthcare recommendations for COVID-19 control following vaccination. The findings of the research on feedback related to vaccines can be used to generate more comprehensive models, that can then be used to generate as many ideas for the wider populace and develop relevant programs and strategies. Medical academics and research-oriented individuals can gain a global perspective via social media activity, which is especially useful during a global epidemic. This research can be reproduced by harvesting tweets on a constant schedule until the COVID-19 epidemic is over to have a better grasp of public mood during the vaccine program in several areas. We want to get a sense as to whether the vaccination effort is going well or whether individuals are conscious of the circumstances.
To proceed further with the research, it was important to review the existing literature found on various academic resources, journals, research papers, and journals related to sentiment analysis and different ML algorithms that can be used for the appropriate analysis of the dataset. Various machine learning algorithms and approaches are accessible for operating the actual parameters of sentiment analysis. Nevertheless, the necessity to identify and apply the appropriate approach before the selection of algorithms is high. Subsequently, the selection of a hybrid algorithmic framework was required to perform the analytical research upon the dataset that was extracted from Kaggle. Research and review of resources assisted us in preparing the paper based upon the COVAX Tweet sentiment analysis dataset available from Kaggle. For further proceedings with the research study, Convolutional Neural Network (CNN) and Artificial Neural Network (ANN) are the chosen frameworks for visualizing the pertinent data within the dataset and shifting them into expressive enlightening content correspondingly. Focussing on the opinions of Aldubaidi et al., (2021), the Convolutional Neural network is amongst the most popular and utilized deep learning network. CNN’s key benefit over its antecedents is that it autonomously recognizes important features without any need for human intervention, making it the most widely utilized. Convolutional neural networks have one or even more layers that can be shared or integrated completely. They do well in paraphrasing identification and semantics parsing, in particular. They’re also used in signal processing and picture categorization. Similarly, based upon the opinion of (El-Khatib, Abu-Nasser, and Abu-Naser 2019), features of Artificial neural Networks were also chosen since Artificial Neural Networks are becoming increasingly prominent due to their capacity to generalize and their resilience to noisy and erroneous input. A great deal of development is being done to increase the efficiency and reliability of neural network modelling and retraining. The paper also explains how the sentiments were analyzed based upon the machine learning-dependent lexicons analysis with adherence to several effective approaches. The sentimental methodological poll was carried out utilizing both methodologies and a supervised machine learning algorithm. This technique has been carried out using either the sophisticated Naive Bayes algorithm or the Support Vector Machine (SVM) (Seeja and Suresh 2019), algorithm to complete the entire analysis part.
Importance of Analyzing Public Sentiments on COVID-19 Vaccines
The entire analysis and outcomes of the sentiment analysis with the relevant ML algorithm are extracted from the social media accounts, and the pertinent dataset collected is also concerned with the details regarding COVID-19 vaccination. To mitigate the risk or avoid the scenario of conflicts related to why tweets were only chosen, and why the other data from several social networking applications were avoided, the research also highlights the results of performance analysis done within a limited source of the dataset. The subsequent part of this paper also includes the proposed solution of the sentiment analysis over the feedback of the public regarding vaccination. For predicting and forecasting the sentiments using machine learning algorithms, the coding section for loading the data, its processing, and analysis is conducted through the exploitation of R programming language within the R studio platform.
The primary aim of the research to analyze the tweets made by users about vaccination to observe their sentiments associated with COVID vaccination using sentiment analysis method and machine learning algorithms.
Objectives:
The different objectives of this research are
1) Choose a suitable vaccination tweets secondary dataset for analysis
2) Analyze the tweets to extract key useful insights
3) Predict the sentiment associated with the tweets by lexicon-based approach.
4) Model machine learning algorithms to fit with predicted sentiment in terms of text-based features extracted from tweets.
5) Assess the performances of ML algorithms and decide the best ML method for sentiment prediction.
6) Recommend strategies to use results obtained from research and briefly introduce possible limitations of the research.
Design of proposed solution:
Sentiment analysis and its interpretation are now a part of computer study that are helpful to analyze feelings of individuals for a specific entity. Here, COVID-19 vaccination is the centralized entity that have been focused for sentiment analysis purposes. The research has been conducted by using some of the attributes that might be related to human feelings. Those factors are attitude of individuals towards the services, and opinions. Moreover, both positive and negative attributes are dependent on this sentimental analysis process. The objective of this sentimental analysis to find out the viewpoints by conducting deep research from the computerized processing perspectives. In this study, the feelings of people from different backgrounds have been surveyed to perform the analysis with maximum accuracy. The subjective statements and other vaccine-related textual contexts have been considered to execute the machine-learning based sentiment analysis in COVID-19 vaccinations.
Review of Existing Literature
Sentiment Analysis through Social Media Platform (like Twitter):
After COVID-19 breakout, sentiment analysis gets priority and comes under the informative research approach in various context. In this approach, all the factors related to the sentimental analysis have been introduced under the electronic era to predict it in a better way. The people views and their attitudes may be apparent in the internet but it is not properly organized and synchronized (Weinzierl and Harabagiu, 2021). To predict the sentiment in the healthcare field, the information can be collected from the healthcare professionals and the data can be obtained from an analysis on the patient sentiments. It can be helpful to overwhelm a communication barrier between the healthcare and the patients.
In this post-pandemic situation, social media plays an innovative tole to analyze sentiments about any area. Weinzierl and Harabagiu (2021) stated that it allows the users to get connected the people to each other from different corners as well as enables to contact the people privately to collect the data during this vaccination period. Therefore, people feel free to express their personal viewpoints through vaccination survey pols.
The social media like Twitter creates statistics in this area that is helpful to identify individual’s opinions about this vaccination (Chaudhri et al. 2021). However, the data visibility is restricted from the social media users as there may be privacy issues. Suppose, the major side of people from the economic backward are not interested in this section who belong from Ghana region.
In that case, the overall opinion statistics viewpoints will be visible and rest will be available in the researcher’s portal only. Thus, the social discriminations issue and privacy concerns can be avoided. So, sentiment analysis is now an important tool to perform discussions on social networking platforms by monitoring the collected data and analyze them appropriately (Amjad et al. 2021). However, it is useful to build statistics on vaccination-interested people and anti-vaccination people. This battle in the vaccination period need to deal with the various related factors such as availability, effectiveness and its side effects (Sandaka and Gaekwade, 2021). This approach is quite difficult to implement practically to perform an appropriate analysis and bring a concluding statement regarding COVID-19 vaccinations.
To mitigate all of these issues, it is required to use computer-based advanced machine learning tools to perform data analysis perfectly. The research analysis under the artificial intelligence easily utilizes the COVID-19 vaccination responses. To implement the theory into the practical format, the researchers need to use the Python tools that will allow to browse the social media API once the data are loaded in the Twitter profiles. Moreover, it offers a hassle-free platform to access the websites by obtaining the users’ credentials and this policy offers a secured platform to predict the data.
Algorithmic Framework for Analyzing Dataset
Mainly two different types of data about the vaccinations during the pandemic period has been collected to analyze the data. For the first round, the feedback is collected during the immediate lockdown period due to sudden COVID-19 broken out. However, the people in this period have shown neural feelings as lack of vaccine availability. So, it was quite unpredictable as most of them have shown lack of analysis. By using various machine learning algorithms, various digital tracking methods have been introduced where general public opinions have been taken into account for further analysis (Wilson and Wiysonge, 2020). To gain proper knowledge, it has been required to perform several machine learning algorithms and it is helpful to enhance the knowledge about the sentimental analysis.
Useful ML algorithms and their Impacts on Sentiment Analysis:
There are lots of machine learning algorithms available to run the sentiment analysis. However, it is required to identify the best approaches before choosing the algorithms that might be helpful to perform the response analysis on the COVID vaccinations.
Lexicon based ML Approach:
In the lexicon-based machine learning approach, the manual crafted protocols can be used to perform data classifications to determine sentiment of individuals. This approach prefers the polarity-based analysis where a combination of positive and negative responses is evaluated along with the strength analysis on the calculated dataset (Araque et al. 2020). It is helpful to generate the functional score. Rule based sentiment analysis algorithms can be used here that can be customized further. So, this approach has been used to evaluate the dataset easily based on the feedback collected from the Twitter profiles.
This Lexicon analysis can be able to count the total number of positive feedbacks and total negative concerns about the vaccinations. If the number positive responses are greater than the negative columns in the datasheet columns, then the ultimate data analysis will show the positive sentiment as the final outcome (Araque et al. 2020). It will also clarify the neutral statement by using the response tracking tools.
Automated Machine Learning Approach:
This automated machine learning approach is useful to figure out the central theme of the statement. Under this approach, the sentimental analysis model tries to improve the existing models by involving many new criteria. Therefore, the overall analysis can be done easily ad this approach is useful to process the collected dataset and then apply effective machine learning algorithms to supervise them appropriately (Prakash et al. 2020).
Convolutional Neural Network
In this report, this advanced automated machine learning approach involves train dataset with multiple samples that are extracted from the twitter responses. This approach allows the machine algorithms to run those samples until the data is not predicted with high level of accuracy in the sentiment analysis theme. Mainly CNN and RNN are used by this advanced automated machine learning approach to predict the data efficiently.
Convolutional neural networks (CNN):
This model can be useful to extract the data from the image content and in this context, the shared opinions in the social media posts can be extracted by using this algorithm. It can be applied to take the image data analysis and then, the train data set are built and finally, the important features associated to the sentimental analysis are evaluated throughout the algorithm.
Recurrent Neural Networks (RNN):
This is another algorithm that is useful to prepare the model based on the sequence of data. The derived data are then feed to the network and check whether the predicted data are analyzed perfectly or not (Carrieri et al. 2021). However, the data are extracted sequentially here and it is helpful to perform sentimental analysis to perform survey on the analyzed data.
Deep learning and Naïve Bayes Algorithms:
Deep learning-based algorithms are sub-sections of machine learning algorithms that are helpful to calculate the data under the AI networks (Memon et al. 2020). This deep-learning based algorithms are useful to resolve the complicated sections on the survey results. In this section, this can be optimized through deep learning method whether the Naïve bayes algorithm is helpful to run an assumption on the data. However, it can be applied by using the concept of the conditional probability such as “the effect of vaccinations on health”. Here, health is the independent variables and vaccination is the dependent variable that needs to be evaluated both from positive and negative sides.
However, this algorithm performs the probability on each value separately and takes more time to implement. However, it provides more accurate results than the CNN and RNN algorithms (Mahesh, 2020). A clear sense has been used to predict this vaccination response analysis. Neutral responses are further analyzed through this predictive model and no incomplete results generated due to this approach.
Random Forest Algorithm:
If the data are selected randomly, then Random Forest Algorithm can be used to perform data analysis from different decision trees. Here, precision, recall, and F1-score are calculated separately to accept the data on the collected and unorganized format. However, the NLP processing model can be used if the data are not organized. In that case, the NLP should be used first and then, it is required to perform survey on those data.
Artificial Neural Network
Hybrid Approach:
In the hybrid approach, the sentiment analysis model might use an effective framework where all the machine-learning algorithms are available. Therefore, a well-designed hybrid network is useful to offer couple of algorithms with flexibility of customization technique.
Run the ML algorithms and perform Sentiment Analysis:
Once the machine learning algorithms are selected, then the next task is to build a hybrid platform to handle the task (Du et al. 2020). Therefore, all the segments need to utilized through various machine learning algorithms and predicts the sentiments of individuals. Once the survey data will be collected from the different online platforms like social media account, Twitter posts, online survey polls, and from the recently published online journals, then data should be sorted according to the feedback categorizations (Tomaszewski et al. 2021). The following steps might be helpful to complete this task without technical hazards. However, the users need to maintain a sequence of processes to meet the project goal effectively.
First, start the opinion polls in digital platform and through social media accounts. After that, the users are requested to participate this “vaccination sentimental analysis” campaign. Thus, they can share their thoughts and opinions freely and researcher might keep all of those the data invisible too all users (Parmar et al. 2021). In that case, the password policy can be applied to protect those data from public visibility.
Secondly, gather all the emojis and categorize them as per their opinion statements. Then, run the machine learning algorithms and check the data with train model to perform sentiment analysis. Make sure that both are test data and train data set are correct and no arbitrary data is there.
Thirdly, choose the classifiers to classify the model immediately once the analysis is going on. It will predict the data according to the probability of the negative and positive feedbacks.
Finally, the train data are verified by applying appropriate the machined algorithms and first, will be executed on a small section of dataset. Once the train dataset is verified and no error is found, then large dataset is considered to make the prediction more accurate. Batch analysis has been chosen to predict the vaccination opinions in a proper sequence. So, the batch processing is used run all the activities and store into separate columns once data analysis has been done (Du et al. 2019). This process will be done repeatedly by calling sub-routing under the batch processing program. At last, that file will be downloaded with proper data analysis.
Data Visualization and Analysis
Discussions and Comparisons on Proposed Algorithms:
To perform such activities, a hybrid algorithmic framework might be selected where multiple ML algorithms and classifiers are taken into account. The Convolutional Neural Network (CNN) along with Artificial Neural Network (ANN) has been chosen visualize the relevant data and transform them into meaningful informative contents respectively. Once the data are organized in a proper sequence, then sentiment analysis is required to analyze sentiments. To perform it appropriately, machine learning based lexicons analysis have been proposed by using the current effective approaches (Mishra et al. 2021). By using both the techniques, the sentimental methodological survey has been conducted through supervised machine learning procedure. This procedure might be conducted through advanced Naïve Bayes algorithm and Support Vector Machine (SVM) algorithm alternatively.
However, those algorithms and classifiers are the fundamental requirements to evaluate the social media data (Twitter responses) about the COVID vaccine. The sentiment analysis results that have been subjected and then, those data are classified into several sub-divisions.
The overall outcome of sentiment analysis is not possible by using those basic learning techniques. According to Joshi et al. (2018), those techniques play significant roles to investigate on the huge dataset and perform primary analysis. In that case, CNN segmentation and Natural Language Processing (NLP) are used to arrange the social media results under high-level patterns. Moreover, the automatic text synthesis is used to convey records for every categories of responses.
After that, the pipelining methods are used to analyze report and once the major part has been done, then qualitative assessments have been conducted. The aim of this section is to identify the public reactions to the vaccination period in real-time analysis process. However, the sentiment analysis of tweets can be conducted if the NLP and other proposed machine learning models are applied successfully (Gonzalez-Dias et al. 2020). Here, data are gathered from the Twitter posts, and help of other Twitter responses. The sentiment reasoning lexicons are formatted by applying the NLP based Python tools.
Cotfas et al. (2021) stated that once the data collection segment is over, then various distinct classifiers have been proposed to classify the organized data. To perform it, Linear SVC and unigrams like classifiers are considered to get the maximum accuracy in the final outcome. To improve the accuracy of the analyzed database and the classifications that have been conducted through tweets that have been collected and assembled through a combination of two separate algorithms like decision-tree making algorithms and Support Vector Machine algorithms (SVM). This integrated model framework is useful to provide better results in terms of F1-score. Thus, the overall accuracy can be maximum with ease that ca be carried out another model Naïve Bayes in another format of analysis. Moreover, the k-nearest methods are used to generate the data with relevant graphical analysis and classifiers. Once this framework is used to perform a basic data analysis, then the advanced level of classification methods can be used to perform rest of the operations successfully (Ong et al. 2020).
Applications of Sentiment Analysis
To check the polarity of the tweeter responses the different deep learning models have been proposed according to their features of analysis. In that case, the LSTM and RNN are used to make a smooth sentiment analysis on the discussed topic. Those classification methods play significant roles in analysis the negative responses from the users’ ends (Smith et al. 2019). Even those data are further categorized in distinct section to minimize the data errors. However, the same process has been used in the case of positive responses. In both the cases, the sub-factors such as “Strongly Agree”, “Strongly Disagree”, “Agree”, “Disagree” are available to run this Tweeter survey. Therefore, it can be useful to predict the vaccination opinion polls and the result might be accurate as much as possible.
On the other hand, the survey is also done on the people who are either not interested to share their viewpoints or they have no idea about this vaccination. As per Riyanto and Azis (2021), the survey data are collected by using the previous approaches and they are categorized as per the questionnaires. Suppose, a person can express their answer by attending a few sorts of survey questionaries and among them, he has chosen the option “Strongly Agree” only for 2 questions and rest of them, he has replied as “No Comments”. In that case the results will be reflected on the overall surveys.
This sentiment analysis can be performed through Random Forest classifications for the F-score measure and evaluation of recall, precision on the collected tweet responses. In this section, the Random Forest algorithm is the best option to perform the final data analysis by involving several factors that are related to this algorithm. Sentimental analysis has been done to conduct the ideology based on the tweet responses and notification feeds. The sentiments of the participants might be either negative or positive towards the vaccinations in the final analysis.
Impacts of ML in Sentiment Analysis for Vaccination:
The aim of the machine learning environment in this project is to determine the public opinion to identify the crisis under the digital tracking system. According to Edo-Osagie et al. (2020), this research is favourable to raise important questions in the digital disease surveillance. Moreover, a new social media-based sentiment analysis methodology has been proposed under the machine learning algorithms with various emoji emphasis.
However, those algorithms follow two-way embedding policies that can analyze both negative and positive traits of the digital platforms. Once all the data collected from the different digital profiles, then those data are checked for relevancy and then, classify those sentiments using symbolic codes. To perform it easily, the long-term memory network has been used according to the internet. The researcher might show the two-way integration to recognize the emoji effectively. An embedding and cutting edge-based machine leaning plays a significant role to perform it with ease.
The feelings are categorized through the standard approaches of the artificial intelligence-based machine earning algorithms and it has several good impacts to pre-process the text information. In this context, it is better to apply a hybrid technique to analyze he sentiments through machine learning process. According to Hussain et al. (2021), the researcher might select the best ML techniques and then, apply them into the sentiment lexicons. Make sure those algorithms are powerful enough to utilize the vaccination surveys and might be effective in sentiment analysis without any technical glitches. To meet all those requirements, the following machine learning algorithms can be used.
Research Gap Analysis:
In this research, the data are taken only from the social media account and profiles to collect huge dataset on concerning the COVID-19 vaccination. Though a novel strategy has been introduced to collect the relevant data from the Twitter platform and implement those data through neural network analysis. As the Tweet posts and responses are take into account, then other social media profiles are ignored quite often. So, the performance analysis has been done within a limited source of dataset. In future, it is required to perform online survey on the various social media platforms to analyses those data with more perfection.
Limited timeframe is another barrier to collect dataset once the research has bee conducted and analysis is done. It was required to perform data analysis and verifications by engaging special data analysis tools and investigate every dataset of train and test data whether they are relevant to each other or not.
Proposed methodology based on design:
In this section the entire design of the proposed solution for sentiment analysis of vaccination feedbacks and predicting the sentiment with machine learning algorithms is described in details. The tweets regarding vaccination data as made by people from world is obtained from Kaggle repository (CoVax Tweet sentiment analysis, 2022). The entire technical coding part for loading the data, pre-processing and analysis is performed using the R programming language in the R studio platform. Different pre-processing and NLP built-in libraries like ‘tm’, “SnowballC”, “wordcloud”, “RColorBrewer”, “syuzhet” and ‘ggplot2’ are used for different purposes like attractive visuals, data manipulation and sentiment extraction. The primary interested variable in the dataset is the tweet text as posted by the people in sample in twitter on vaccination. Hence, this is first extracted and used to get the sentiment of that tweet as positive, negative or neutral by lexicon-based approach. Now, sentiment extraction depends on the words which are selected by the lexicon method and thus several stages of pre-processing is performed on the tweets to filter out unnecessary features. The most common pre-processing is to remove special characters such as ‘/’, ‘@’, ‘:’ etc. which has no relation with the sentiment as people do not use these symbols to express emotion in a statement. Then all letters are converted in the tweets into lowercases such that same words in different letters are considered as same in the lexicon method. Then it will be beneficial to remove the common stop words from the texts, in English common these are like ‘a’, ‘the’, ‘are’, ‘was’ or basically the verbs and pre-positions. Some custom stop words are also frequently removed from the texts before applying lexicon method which are expected to have no effect on determining the sentiment in the text. In the chosen dataset the vaccination feedbacks as given in tweets has words like ‘http’ which are used to provide website link for referring something can be considered stop word and thus it is removed. Punctuation removal is also an important step in pre-processing the vaccination tweets. At the final step of pre-processing the extra white spaces from comments typically the leading and trailing spaces are removed as they have no relation with the vaccination tweets. Then text stemming is performed which basically reduces the words to their root form.
Then from the pre-processed tweets’ texts a term document matrix is constructed which contains the frequent words in all texts and their frequencies. This can also be better visualized with a word cloud in which words are plotted in a space with different colors and different font size. Generally bigger font size is used for words with higher frequencies. As usually a lot of words with corresponding frequencies are present in the term-document matrix thus to limit number of words in the limited space of word cloud such that words are not overlapped and can be easily read, top 50 or 100 words are presented in the cloud (Anandarajan, Hill and Nolan 2019). Thus, in the cloud the word with a biggest font has the highest frequency in the term-document matrix.
The association between words is also an important thing to investigate as it will show which words are well correlated with the top words in the term-document matrix. This can be achieved by calculating correlation between the top words and the rest words by their mutual occurrences in texts and the words which have significant correlation coefficient values (typically over 0.25) are produced in the output.
Lexicon based sentiment extraction method:
Finally the sentiment of the texts can be extracted by applying the get_sentiment() function of syuzhet package with appropriate method. There are four methods available for sentiment extraction in the provided package which are syuzhet method, bing method, afinn method and nrc method. In nrc method more than just numeric score of sentiment is extracted from text and requires additional interpretation and thus not relevant in this project and thus discussion about nrc method is out of scope. In the syuzhet method sentiment scores are given a output in a continuous scale between -1 to +1 where -1 indicating most negative sentiment and +1 indicating most positive sentiment (Kiruthika et al. 2021). In the bing method the sentiments are given as binary output as +1 or -1 and the neural sentiment is shown by the zero value. In the afinn method the sentiments are provided in a integer scale from -5 to +5. Now, as sentiments are needed to be classified based on machine learning models as obtained by lexicon-based approach, hence bing method is used here to get binary sentiment score for positive and negative sentiments and 0 for neural or no sentiment on the tweet. The bing method of manual sentiment extraction is developed by Bing Liu and Minqing Hu and described in paper in 2004 where objective is to get the opinions of customer reviews. In that paper a total of 6786 words are classified as either positive or negative (4781 were negative and rest of 2005 are positive). Thus, in this approach a dictionary of words is provided with labelled positive or negative sentiment and instances in the text dataset is labelled according to that. In case where words are not present in the dictionary then that text is assigned as neutral or no sentiment. The formula for calculating sentiment score by bing method is
Bing sentiment Score = (number of positive words – number of negative words) / total number of words
Thus, it is guaranteed to get a value between -1 and 1. When the value is over zero then it is considered as positive sentiment and labelled +1 and when it is less than zero then it is considered as negative sentiment and labelled as -1. A perfect zero score occurs when the number of positive words equal to number of negative words in the text and thus the text is considered as neural or no sentiment. The issue with bing method is that it is unable to detect correct sentiment from twisted comment like worth, gain, advantage are positive words but can be used in negative motive and in that case, sentiment predicted by bing method will be incorrect. There are also exist words that are equally used for expressing positive or negative sentiment like the word ‘significant’ and bing method is unable to extract deeper meaning from those texts.
Now, next part of methodology is to use machine learning models for predicting the sentiments as obtained with the texts in terms of textual features, however, it can also be modelled by non-textual features but the sentiment attached with the tweet can be low correlated with non-textual features and thus not used. By non-textual it is meant that the features which can be related with the associated sentiment but is not directly a part of comment which is made by the user in twitter. In the dataset these features can be the location of user, the date at which user created the twitter account, the number of followers of the user, the numbers of friends of user, the number of favourites of user, the verification status of user in twitter, the date at which comment is made and hashtags associated with the feedback text (Mitra 2020). Some these features are numeric and some are categorical and thus necessary pre-processing is needed before applying the machine learning models for training with features and known labels of sentiment as obtained by lexicon method. The numeric features are often standardized or normalized to represent all of them in the same scale while the categorical features are either label encoded to unique integers for each unique category of the feature or one-hot encoded to create equal number of binary columns as the number of categories in the feature. One-hot encoding here is more appropriate as categorial features do not have specific ordering and thus influence of each category of a feature must be considered same. Then the number of feature dimensions are necessarily increased and hence dimensionality is needed to be reduced univariate, model based or sequential feature selection method to fit machine learning models with feasible time complexity. Finally, machine learning models like random forest, naïve bayes, linear discriminant classifier methods can be fitted on significantly large section (train set) of pre-processed feature and its corresponding sentiment label. Then those models can be evaluated on rest portion of data (test set) and fine-tuned accordingly to extract best evaluation results in terms of relevant metrics like accuracy, sensitivity, specificity, area under ROC curve etc. Finally, the model with best performance on the test set can be recommended for sentiment prediction in terms of stated textual features.
Applied methodology for the research
The dataset which is used as mentioned earlier contains a total of 69718 instances of vaccination related tweets with a total of 16 attributes namely unique id of the tweet, name of the user who made the tweet, location from where tweet is made, description about twitter user, time when user created account, number of followers of user, number of friends user, number of favourites of user, user profile verification status, date on which tweet was made, the tweet text, the hashtags used in tweets, source machine from which tweet is made, the number of retweets of the users, the number of favourites for the user and retweet status of the tweet. Among these variables the only interesting column is the tweet text as it contains the sentiment attached with the tweet. of Now, the approach of the method based on the dataset is summarized below. In the overall method many libraries of R (“tm”, “SnowballC”, “wordcloud”, “RColorBrewer”, “syuzhet”, “ggplot2”, “superml”, “tidyverse”, “MASS”, “pROC”, “randomForest”, “naivebayes”) are used for several purposes like term-document matrix creation, pre-processing text data, attractive visualizations, sentiment extraction, ML models fitting and evaluation activities. Now, after loading the dataset the first pre-processing which is performed is replacement of non-English characters from the tweet comments with corresponding Latin characters by using the gsub() method in combination with the iconv() function. Then those comments vectors of tweets are converted to a corpus vector form for term document analysis. In the term document analysis, some pre-processing is performed where special characters that have no influence on the sentiment of the text like ‘/’, ‘@’, ‘:’ and then all of the words are converted to lowercase. Finally, end empty spaces in the texts are stripped off and the term document matrix is created. The term document matrix is then sorted in descending order based on the frequency of words and the top 10 most frequent words are displayed as given below.
Bar plot of top 10 most frequent words:
It can be observed that the top words which are used over 800 comments are https and tco indicating a significant number of people referenced a website in their comments while making a vaccination related tweet.
Now, the most popular words in all the tweets can be better visualized with a word cloud which is shown below.
Word cloud with top 100 words:
From the above word cloud in which words are coloured and sized differently based on their frequencies in an empty space it can be seen that apart from https and tco some frequent words are sputnik, vaccin, covaxin, covid, Pfizer BioNTech, Sinovac, Sinopharm are some of the most popular words (Kabir et al. 2018).
Now, the sentiment attached to the tweet is estimated by the get_sentiment() function of syuzhet library using the bing method which was described in brief earlier. This resulted in integer scoring of the tweets which are mostly -1, 0 and 1, however, for some tweets which are detected as outliers by the bing method are assigned to positive and negative integer scores over 1 and below -1 respectively. These outlier scores are replaced with -1 for negative scores and 1 for positive scores respectively assuming the extreme negative indicates negative sentiment and extreme positive indicates positive sentiment.
Sentiment distribution as obtained by lexicon-based approach with bing method:
bing_vector
-1 0 1
13269 36928 19521
As seen from above most of the tweets are recognized as 0 or no sentiment which there are a significant number of positive sentiments are there and lowest number of negative sentiments. Hence, it can be stated that in the sample there are mostly people with neutral view about vaccination and thereafter there are some positive views about vaccination.
Now, the machine learning approach is employed to model the obtained sentiments based on the textual features of the vaccination tweets. Hence, necessary pre-processing is required to convert text to numeric features. This is done by using the TFIDF vectorizer method from “superml” library. The maximum features to be extracted are limited to 10 with removal of common English stop words from texts and normalization is not performed to reduce computational complexity of text to numeric conversion. Then the numeric features with 10 columns as obtained in a matrix are converted to a DataFrame in which at the end the class column, sentiment is added as obtained earlier by ‘bing’ method. Now, the DataFrame is ready for fitting machine learning models and thus three different ML models namely naïve bayes, random forest, and linear discriminant classifier are chosen to fit data.
The advantages of LDA classifier are that it is very fast, portable and simple to implement and have less time complexity than some algorithms like the logistic regression. Also, the algorithm is good for complete numeric data and the extracted features are all numeric, hence LDA is well suited. The LDA is also known as the normal discriminant analysis as it assumes normality assumption of features. The algorithm is basically generalization of fisher’s discriminant analysis which is used to find a linear combination of features that can separated two or more classes in the dataset (Kolukisa et al. 2018). The LDA results can also be used for dimensionality reduction, however, in this it is used for classification as the dimensions of features are not large. LDA is very closely analogical to analysis of variance and linear regression in which it is also attempted to express a dependent variable by linear combination of independent variables. The difference is that is ANOVA a categorical independent variable is used to separate a continuous dependent variable, however, in LDA continuous independent variables are used to separate categorical dependent variable also known as class.
The other chosen algorithm is random forest which is particularly known to work better with unscaled data and missing values does not affect the performance of algorithm. Random forest is an optimized version of decision tree algorithm as predictions are made by best performing tree among a nest of decision tree known as algorithm’s base estimator. In random forest the instances are considered individually and the that instance is considered for prediction which has the highest number of votes among all. The random forest can handle large dataset as it can work with many feature dimensions making it perfectly suitable for the dataset of this project as it is sufficiently large (Yuchi et al. 2019). The other advantage of random forest is that through it the estimates of importance of different features can be evaluated and among most of the classifiers random forests are found to produce highest accuracy by many researchers. The random forest can also automatically deal with imbalanced data by balancing the class with assigning different weights to different class labels. The other advantage of using random forest is that algorithm handles the variables really fast making it particularly suitable for complicated tasks.
The last machine learning classifier used is the naïve bayes which classifies instances based on Bayes theorem employing conditional probability. This algorithm is the most basic and simple to implement yet for most of the datasets provides good accuracy. Naïve bayes algorithm used by researches for various purposes like facial recognition, weather prediction, medical diagnosis, news classification and sentiment prediction due to its several advantages. Apart from its simplicity it does not require much data for training, it can handle both continuous and categorical data, the algorithm is highly scalable or number of predictors and instances can be increased indefinitely without compromising the computational time of algorithm by much. As a result, the naïve bayes classifier can be employed for making real-time predictions (Rahat, Kahir and Masum 2019). The algorithm is not sensitive to features that are irrelevant and thus less pre-processing required when fitting this algorithm. In case of text data, the naïve bayes outperforms most of the ML classifier and thus for vaccination tweet sentiment prediction in which features are textual a good performance with naïve bayes is expected.
Now, the DataFrame as obtained earlier is split into train and test set in 70:30 ratio with a specific random seed to produce consistent results. This done by sampling randomly about 70% of instances from the whole DataFrame by logical indexing and reversing the index the rest 30% of instances are selected to make the test set. Also, before splitting it is observed that in the DataFrame the columns for words s, covaxin and dose have mostly 0 values and thus considered as irrelevant features and removed from DataFrame. Thus, on the train set with rest features, three machine learning model are applied one by one with a formula to predict sentiment based on rest variables in DataFrame or features. After models are fitted, predictions are made to test set to extract predicted class labels for instances and their probability measures for three classes. The evaluation of predictions is made with several metrics like accuracy, sensitivity, specificity and others by comparing them to known labels for the test set. The ROC curve for the positive sentiment class is also made by its corresponding probabilities and class labels and area under the curve is displayed for each classifier.
In this section the fitting summary of three classifiers namely LDA, random forest and naïve bayes are presented along with their evaluation results in the test set. The results for LDA and naïve bayes are reproducible, however, for random forest slight variations can be found due to randomness in tree selection by the algorithm but it will not be significantly different from what is obtained.
Fitting results of LDA model:
Call:
lda(sentiment ~ ., data = train)
Prior probabilities of groups:
-1 0 1
0.1896848 0.5295685 0.2807467
Group means:
t co vaccine moderna covid covid19 sputnikv
-1 1.112331 1.042440 0.6981656 0.5603760 0.3275249 0.3820099 0.3140138
0 1.086061 1.037158 0.7604549 0.6669491 0.4254924 0.3983632 0.3638999
1 1.090670 1.032344 0.7366728 0.6259903 0.3861865 0.3052385 0.2953059
Coefficients of linear discriminants:
LD1 LD2
t -1.16736266 -0.8587010
co 1.24978539 0.5623224
vaccine 0.09656929 0.2227691
moderna 0.50917979 0.2277933
covid 0.41419741 0.2266524
covid19 0.35705904 -0.7825165
sputnikv 0.53931162 -0.2561813
Proportion of trace:
LD1 LD2
0.7556 0.2444
The LDA classifier while fitting calculates the prior probability estimates of class labels and it is found that class 0 has the highest probability over 50%. The means of different features among the classes are not found to be varying by much (Mahdianpari et al. 2018). The separation of classes is done by two LD vectors and their estimates can be seen to be sufficiently varying for all the features.
Evaluation results of LDA model in test set:
Confusion Matrix and Statistics:
Reference
Prediction -1 0 1
-1 0 0 1
0 4012 11084 5819
1 0 0 0
Overall Statistics
Accuracy : 0.5299
95% CI : (0.5231, 0.5367)
No Information Rate : 0.5299
P-Value [Acc > NIR] : 0.5028
Kappa : 0
Mcnemar’s Test P-Value : <2e-16
Statistics by Class:
Class: -1 Class: 0 Class: 1
Sensitivity 0.000e+00 1.0000000 0.0000
Specificity 9.999e-01 0.0001017 1.0000
Pos Pred Value 0.000e+00 0.5299546 NaN
Neg Pred Value 8.082e-01 1.0000000 0.7217
Prevalence 1.918e-01 0.5299292 0.2783
Detection Rate 0.000e+00 0.5299292 0.0000
Detection Prevalence 4.781e-05 0.9999522 0.0000
Balanced Accuracy 5.000e-01 0.5000509 0.5000
It can be seen that the LDA classifier in test set produced an accuracy of about 52.99% and p value of the model is not significant indicating that the classifier is not able to differentiate classes with high accuracy based on features. The specificity, sensitivity and other scores are significantly higher for class 0 than other classes indicating model is biased towards class 0 which is indicated from the confusion matrix as almost all instances are predicted as sentiment label 0 or no sentiment.
ROC curve of positive class of LDA model:
The ROC curve of the positive class 1 shows the curve is just over the average diagonal line result area under the curve of 0.533 which is not very high as maximum value is 1. This indicates the true positive rate is not significantly better than the false positives.
Random forest model fitting results:
Call:
randomForest(formula = sentiment ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 47.04%
Confusion matrix:
-1 0 1 class.error
-1 0 9251 6 1.0000000000
0 0 25838 6 0.0002321622
1 1 13692 8 0.9994161010
Now, the fitting results of the random forest model shows that a total of 500 decision trees are used in the forest with different depths and at a minimum of two features are used for splitting tree. The class error can be observed to very high for class -1 and 1 while sufficiently low for class 0. The OOB estimate of error rate is also large of about 47% indicating the model is not well fitted to predict all classes with high accuracy (Gholizadeh et al. 2020).
Evaluation results of random forest model in test set:
Confusion Matrix and Statistics
Reference
Prediction -1 0 1
-1 0 0 0
0 4011 11080 5819
1 1 4 1
Overall Statistics
Accuracy : 0.5298
95% CI : (0.523, 0.5366)
No Information Rate : 0.5299
P-Value [Acc > NIR] : 0.5194
Kappa : -2e-04
Mcnemar’s Test P-Value : <2e-16
Statistics by Class:
Class: -1 Class: 0 Class: 1
Sensitivity 0.0000 0.9996391 1.718e-04
Specificity 1.0000 0.0002034 9.997e-01
Pos Pred Value NaN 0.5298900 1.667e-01
Neg Pred Value 0.8082 0.3333333 7.217e-01
Prevalence 0.1918 0.5299292 2.783e-01
Detection Rate 0.0000 0.5297380 4.781e-05
Detection Prevalence 0.0000 0.9997131 2.869e-04
Balanced Accuracy 0.5000 0.4999213 4.999e-01
It can be seen that the random forest classifier in test set produced an accuracy of about 52.98% and p value of the model is not significant indicating that the classifier is not able to differentiate classes with high accuracy based on features. The specificity, sensitivity and other scores are significantly higher for class 0 than other classes indicating model is biased towards class 0 which is indicated from the confusion matrix as almost all instances are predicted as sentiment label 0 or no sentiment.
ROC curve of positive class with random forest model:
The ROC curve of the positive class 1 shows the curve is very close to the average diagonal line result area under the curve of 0.503 which is not very high as maximum value is 1. This indicates the true positive rate is not significantly better than the false positives.
Naïve Bayes model fitting results:
Call:
naive_bayes.formula(formula = sentiment ~ ., data = train)
Laplace smoothing: 0
A priori probabilities:
-1 0 1
0.1896848 0.5295685 0.2807467
t -1 0 1
mean 1.1123310 1.0860613 1.0906696
sd 0.5286162 0.5432999 0.5288540
co -1 0 1
mean 1.0424402 1.0371576 1.0323437
sd 0.4327054 0.4860120 0.4601782
vaccine -1 0 1
mean 0.6981656 0.7604549 0.7366728
sd 1.0832325 1.1256411 1.1155745
moderna -1 0 1
mean 0.5603760 0.6669491 0.6259903
sd 1.0699614 1.1580869 1.1124283
covid -1 0 1
mean 0.3275249 0.4254924 0.3861865
sd 0.9844097 1.1269664 1.0667331
The gaussian naïve bayes model shows the estimated prior probabilities from the train set is sufficiently large for class 0 than other classes. The mean values of the features do not vary much between features and so as the standard deviations which are low indicating the spread of features within class groups are small (Singh et al. 2019).
Naïve Bayes model evaluation results:
Confusion Matrix and Statistics
Reference
Prediction -1 0 1
-1 0 0 0
0 4012 11084 5820
1 0 0 0
Overall Statistics
Accuracy : 0.5299
95% CI : (0.5231, 0.5367)
No Information Rate : 0.5299
P-Value [Acc > NIR] : 0.5028
Kappa : 0
Mcnemar’s Test P-Value : NA
Statistics by Class:
Class: -1 Class: 0 Class: 1
Sensitivity 0.0000 1.0000 0.0000
Specificity 1.0000 0.0000 1.0000
Pos Pred Value NaN 0.5299 NaN
Neg Pred Value 0.8082 NaN 0.7217
Prevalence 0.1918 0.5299 0.2783
Detection Rate 0.0000 0.5299 0.0000
Detection Prevalence 0.0000 1.0000 0.0000
Balanced Accuracy 0.5000 0.5000 0.5000
It can be seen that the naïve bayes classifier in test set produced an accuracy of about 52.99% and p value of the model is not significant indicating that the classifier is not able to differentiate classes with high accuracy based on features. The specificity, sensitivity and other scores are significantly higher for class 0 than other classes indicating model is biased towards class 0 which is indicated from the confusion matrix as almost all instances are predicted as sentiment label 0 or no sentiment.
ROC curve of positive class with naïve bayes model:
The ROC curve of the positive class 1 shows the curve is very close to the average diagonal line result area under the curve of 0.538 which is not very high as maximum value is 1. This indicates the true positive rate is not significantly better than the false positives.
Hence, comparing the classifiers’ performances it can be seen that accuracy of models are more or less same, however, all the models are biased towards class 0. The number of correct predictions for class -1 and 1 are found to be highest with random forest classifier as observed from the confusion matrix and thus it can be stated that random forest is the best model for predicting sentiment based on text-based features.
Conclusion
In conclusion it can be stated that the sentiment analysis with an appropriate COVID vaccination dataset is successfully performed by both lexicon-based and machine learning based approach. While there is no method to evaluate the performance of lexicon-based approach, the machine learning based approach seems to be biased to neutral sentiment prediction. It can be because of the imbalanced distribution of class as there are significantly a greater number of neutral sentiments than positive and negative sentiments as obtained from lexicon-based approach. Thus, for accurate sentiment prediction the lexicon-based approach is recommended over machine learning approach. However, lexicon-based approach can mislead when in the vaccination tweets there are too many terms which are not unknown to the sentiment dictionary of the used method. In that case machine learning based approach can be useful, however, for training the ML model correct sentiments for a sufficiently large number of tweets must be known. Also, for limited computational capacity only some basic ML models are applied for sentiment prediction with small number of text dimensions and thus there is always a chance of getting better results with advanced methods like Neural networks and with a different text to numeric feature extractor (like hashing vectorizer or count vectorizer). Also, under-sampling or oversampling feature set with labels can improve models’ performances over positive and negative class and thus as future extension of this project these possibilities will be explored in an attempt to build a model with better overall performance.
Alamoodi, A.H., Zaidan, B.B., Al-Masawa, M., Taresh, S.M., Noman, S., Ahmaro, I.Y., Garfan, S., Chen, J., Ahmed, M.A., Zaidan, A.A. and Albahri, O.S., 2021. Multi-perspectives systematic review on the applications of sentiment analysis for vaccine hesitancy. Computers in Biology and Medicine, 139, p.104957.
Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M. and Farhan, L., 2021. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of big Data, 8(1), pp.1-74.
Amjad, A., Qaiser, S., Anwar, A. and Ali, R., 2021, September. Analyzing Public Sentiments Regarding COVID-19 Vaccines: A Sentiment Analysis Approach. In 2021 IEEE International Smart Cities Conference (ISC2) (pp. 1-7). IEEE.
Anandarajan, M., Hill, C. and Nolan, T., 2019. Term-document representation. In Practical text analytics (pp. 61-73). Springer, Cham.
Araque, O., Gatti, L. and Kalimeri, K., 2020. MoralStrength: Exploiting a moral lexicon and embedding similarity for moral foundations prediction. Knowledge-based systems, 191, p.105184.
Carrieri, V., Lagravinese, R. and Resce, G., 2021. Predicting vaccine hesitancy from area-level indicators: A machine learning approach. medRxiv.
Chaudhri, A.A., Saranya, S.S. and Dubey, S., 2021. A Survey on Analyzing COVID-19 Vaccines on Twitter Dataset Using Tweepy and Text Blob. Annals of the Romanian Society for Cell Biology, pp.8579-8581.
Cotfas, L.A., Delcea, C., Roxin, I., Ioan??, C., Gherai, D.S. and Tajariol, F., 2021. The Longest Month: Analyzing COVID-19 Vaccination Opinions Dynamics from Tweets in the Month following the First Vaccine Announcement. IEEE Access, 9, pp.33203-33223.
Du, J., Luo, C., Shegog, R., Bian, J., Cunningham, R.M., Boom, J.A., Poland, G.A., Chen, Y. and Tao, C., 2020. Use of Deep Learning to Analyze Social Media Discussions About the Human Papillomavirus Vaccine. JAMA network open, 3(11), pp.e2022025-e2022025.
Du, J., Luo, C., Wei, Q., Chen, Y. and Tao, C., 2019. Exploring difference in public perceptions on HPV vaccine between gender groups from Twitter using deep learning. arXiv preprint arXiv:1907.03167.
Edo-Osagie, O., De La Iglesia, B., Lake, I. and Edeghere, O., 2020. A scoping review of the use of Twitter for public health research. Computers in biology and medicine, 122, p.103770.
El-Khatib, M.J., Abu-Nasser, B.S. and Abu-Naser, S.S., 2019. Glass classification using artificial neural network.
Fu, R., Tian, Y., Bao, T., Meng, Z. and Shi, P., 2019. Improvement motor imagery EEG classification based on regularized linear discriminant analysis. Journal of medical systems, 43(6), pp.1-13.
Gholizadeh, M., Jamei, M., Ahmadianfar, I. and Pourrajab, R., 2020. Prediction of nanofluids viscosity using random forest (RF) approach. Chemometrics and Intelligent Laboratory Systems, 201, p.104010.
Gonzalez-Dias, P., Lee, E.K., Sorgi, S., de Lima, D.S., Urbanski, A.H., Silveira, E.L. and Nakaya, H.I., 2020. Methods for predicting vaccine immunogenicity and reactogenicity. Human vaccines & immunotherapeutics, 16(2), pp.269-276.
He, F., Deng, Y. and Li, W., 2020. Coronavirus disease 2019: What we know?. Journal of medical virology, 92(7), pp.719-725.
Hussain, A., Tahir, A., Hussain, Z., Sheikh, Z., Gogate, M., Dashtipour, K., Ali, A. and Sheikh, A., 2021. Artificial intelligence–enabled analysis of public attitudes on facebook and twitter toward covid-19 vaccines in the united kingdom and the united states: Observational study. Journal of medical Internet research, 23(4), p.e26627.
Joshi, A., Dai, X., Karimi, S., Sparks, R., Paris, C. and MacIntyre, C.R., 2018, October. Shot or not: Comparison of NLP approaches for vaccination behaviour detection. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task (pp. 43-47).
Jost, J.T., Barberá, P., Bonneau, R., Langer, M., Metzger, M., Nagler, J., Sterling, J. and Tucker, J.A., 2018. How social media facilitates political protest: Information, motivation, and social networks. Political psychology, 39, pp.85-118.
Kabir, A.I., Karim, R., Newaz, S. and Hossain, M.I., 2018. The Power of Social Media Analytics: Text Analytics Based on Sentiment Analysis and Word Clouds on R. Informatica Economica, 22(1).
Kaggle.com. 2022. CoVax Tweet sentiment analysis. [online] Available at: <https://www.kaggle.com/aydanjiwani/covax-tweet-sentiment-analysis/data?select=vaccination_tweets.csv> [Accessed 9 March 2022].
Kiruthika, J.K., Janani, A.P., Sudha, M. and Yawanikha, T., 2021, May. Fine Grained Sentimental Analysis of Social Network Chat Using R. In Journal of Physics: Conference Series (Vol. 1916, No. 1, p. 012210). IOP Publishing.
Kolukisa, B., Hacilar, H., Goy, G., Kus, M., Bakir-Gungor, B., Aral, A. and Gungor, V.C., 2018, December. Evaluation of classification algorithms, linear discriminant analysis and a new hybrid feature selection methodology for the diagnosis of coronary artery disease. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 2232-2238). IEEE.
Liu, R., Shi, Y., Ji, C. and Jia, M., 2019. A survey of sentiment analysis based on transfer learning. IEEE Access, 7, pp.85401-85412.
Mahdianpari, M., Salehi, B., Mohammadimanesh, F., Brisco, B., Mahdavi, S., Amani, M. and Granger, J.E., 2018. Fisher Linear Discriminant Analysis of coherency matrix for wetland classification using PolSAR imagery. Remote Sensing of Environment, 206, pp.300-317.
Mahesh, B., 2020. Machine Learning Algorithms-A Review. International Journal of Science and Research (IJSR).[Internet], 9, pp.381-386.
Mahesh, B., 2020. Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9, pp.381-386.
Melton, C.A., Olusanya, O.A., Ammar, N. and Shaban-Nejad, A., 2021. Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health, 14(10), pp.1505-1512.
Memon, S.A., Tyagi, A., Mortensen, D.R. and Carley, K.M., 2020, October. Characterizing sociolinguistic variation in the competing vaccination communities. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation (pp. 118-129). Springer, Cham.
Mishra, A., Wajid, M.S. and Dugal, U., 2021. A Comprehensive Analysis of Approaches for Sentiment Analysis Using Twitter Data on COVID-19 Vaccines.
Mitra, A., 2020. Sentiment analysis using machine learning approaches (Lexicon based on movie review dataset). Journal of Ubiquitous Computing and Communication Technologies (UCCT), 2(03), pp.145-152.
Naldi, M., 2019. A review of sentiment computation methods with R packages. arXiv preprint arXiv:1901.08319.
Ong, E., Wang, H., Wong, M.U., Seetharaman, M., Valdez, N. and He, Y., 2020. Vaxign-ML: supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens. Bioinformatics, 36(10), pp.3185-3191.
Parmar, M., Ambalavanan, A.K., Guan, H., Banerjee, R., Pabla, J. and Devarakonda, M., 2021. COVID-19: Comparative Analysis of Methods for Identifying Articles Related to Therapeutics and Vaccines without Using Labeled Data. arXiv preprint arXiv:2101.02017.
Perdana, R.S. and Pinandito, A., 2018. Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis. Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 10(1-8), pp.41-46.
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Larochelle, H., 2021. Improving reproducibility in machine learning research: a report from the NeurIPS 2019 reproducibility program. Journal of Machine Learning Research, 22.
Prakash, K.B., Imambi, S.S., Ismail, M., Kumar, T.P. and Pawan, Y.N., 2020. Analysis, prediction and evaluation of covid-19 datasets using machine learning algorithms. International Journal, 8(5).
Rahat, A.M., Kahir, A. and Masum, A.K.M., 2019, November. Comparison of Naive Bayes and SVM Algorithm based on sentiment analysis using review dataset. In 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART) (pp. 266-270). IEEE.
Riyanto, R. and Azis, A., 2021. Application of the Vector Machine Support Method in Twitter Social Media Sentiment Analysis Regarding the Covid-19 Vaccine Issue in Indonesia. Journal of Applied Data Sciences, 2(3), pp.102-108.
Sandaka, G.K. and Gaekwade, B.N., 2021. Sentiment Analysis and Time-series Analysis for the COVID-19 vaccine Tweets.
Sattar, N.S. and Arifuzzaman, S., 2021. COVID-19 Vaccination awareness and aftermath: Public sentiment analysis on Twitter data and vaccinated population prediction in the USA. Applied Sciences, 11(13), p.6128.
Seeja, R.D. and Suresh, A., 2019. Deep learning based skin lesion segmentation and classification of melanoma using support vector machine (SVM). Asian Pacific journal of cancer prevention: APJCP, 20(5), p.1555.
Singh, G., Kumar, B., Gaur, L. and Tyagi, A., 2019, April. Comparison between multinomial and Bernoulli naïve Bayes for text classification. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM) (pp. 593-596). IEEE.
Smith, C.C., Chai, S., Washington, A.R., Lee, S.J., Landoni, E., Field, K., Garness, J., Bixby, L.M., Selitsky, S.R., Parker, J.S. and Savoldo, B., 2019. Machine-learning prediction of tumor antigen immunogenicity in the selection of therapeutic epitopes. Cancer immunology research, 7(10), pp.1591-1604.
Speiser, J.L., Miller, M.E., Tooze, J. and Ip, E., 2019. A comparison of random forest variable selection methods for classification prediction modeling. Expert systems with applications, 134, pp.93-101.
Tomaszewski, T., Morales, A., Lourentzou, I., Caskey, R., Liu, B., Schwartz, A. and Chin, J., 2021. Identifying False Human Papillomavirus (HPV) Vaccine Information and Corresponding Risk Perceptions From Twitter: Advanced Predictive Models. Journal of medical Internet research, 23(9), p.e30451.
Weinzierl, M.A. and Harabagiu, S.M., 2021. Automatic detection of COVID-19 vaccine misinformation with graph link prediction. Journal of biomedical informatics, p.103955.
Wilson, S.L. and Wiysonge, C., 2020. Social media and vaccine hesitancy. BMJ Global Health, 5(10), p.e004206.
Yuchi, W., Gombojav, E., Boldbaatar, B., Galsuren, J., Enkhmaa, S., Beejin, B., Naidan, G., Ochir, C., Legtseg, B., Byambaa, T. and Barn, P., 2019. Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city. Environmental pollution, 245, pp.746-753.
Zhang, L., Wang, S. and Liu, B., 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), p.e1253.