Predicting sentiment and rating of tourist reviews using machine learning

Purpose – As the tourism industry becomes more vital for the success of many economies around the world, the importance of technology in tourism grows daily. Alongside increasing tourism importance and popularity, the amount of significant data grows, too. On daily basis, millions of people write their opinions, suggestions and views about accommodation, services, and much more on various websites. Well-processed and filtered data can provide a lot of useful information that can be used for making tourists ’ experiences much better and help us decide when selecting a hotel or a restaurant. Thus, the purpose of this study is to explore machine and deep learning models for predicting sentiment and rating from tourist reviews. Design/methodology/approach – This paper used machine learning models such as Na ı € ve Bayes, support vector machines (SVM), convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM) for extracting sentiment and ratings from tourist reviews. These models were trained to classify reviews into positive, negative, or neutral sentiment, and into one to five grades or stars. Data used for training the models were gathered from TripAdvisor, the world ’ s largest travel platform. The models based on multinomial Na ı € ve Bayes (MNB) and SVM were trained using the term frequency-inverse document frequency (TF-IDF) for word representations while deep learning models were trained using global vectors (GloVe) for word representation. The results from testing these models are presented, compared and discussed. Findings – The performance of machine and learning models achieved high accuracy in predicting positive, negative, or neutral sentiments and ratings from tourist reviews. The optimal model architecture for both classification tasks was a deep learning model based on BiLSTM. The study ’ s results confirmed that deep learning models are more efficient and accurate than machine learning algorithms. Practical implications – The proposed models allow for forecasting the number of tourist arrivals and expenditure,gaininginsightsintothetourists ’ profiles,improving overallcustomerexperience,andupgrading marketing strategies. Different service sectors can use the implemented models to get insights into customer satisfaction with the products and services as well as to predict the opinions given a particular context. Originality/value – This study developed and compared different machine learning models for classifying customerreviewsaspositive,negative,orneutral,aswellaspredictingratingswithonetofivestarsbasedonaTripAdvisorhotelreviewsdatasetthatcontains20,491uniquehotelreviews.


Introduction
Customer experience and opinion are crucial for the enhancement of the tourism industry. Therefore, this industry has already largely adapted to information and communication technologies and the advent of big data (Madyatmadja et al., 2021). Currently, many tourist services are available online such as booking websites (Manosso and Domareski Ruiz, 2021).
Analyzing tourist reviews using machine learning Since tourists use a lot of websites and social media to leave their personal opinions or comments on a specific place or service, customer reviews have become a significant factor when deciding which possible hotels or restaurants to visit (Neidhardt et al., 2017). For example, a number of all reviews on TripAdvisor overpassed a total of 884 million in 2020 (Statista, 2020). Information obtained from these reviews is important to other tourists but also to service providers who can then note key aspects that make their hotel/restaurant good or bad (Sumarsono et al., 2018).
In parallel with the huge increase in the number of online user reviews, there is also a growing need for automated processing of these huge amounts of data because it is impossible for humans to read and analyze all these reviews on their own (Gour et al., 2021). Sentiment analysis is a technique used by natural language processing to identify and extract information in data (Collobert and Weston, 2008). In most cases, it means to determine whether a review expresses positive, or negative sentiment (Barbierato et al., 2021). Although there is much research in sentiment analysis of tourist reviews over the past decade, most of the research is limited to positive/negative classification. Fewer studies include neutral review sentiment in addition to positive/negative (Wadhe and Suratkar, 2020), which is a more demanding task and is included in this study. Adding neutral in sentiment classification is important because it gives us additional useful information. A neutral comment is usually an indicator of concern since the customer can easily turn positive or negative. Thus, taking into consideration neutral comments can help one to increase the number of satisfied customers since it is easier to turn a neutral experience into a positive than a negative one. Moreover, even fewer studies include rating classification and prediction based on tourist reviews (Harrag et al., 2019), which are also analyzed in this study. Performing rating prediction is useful when one has a lot of customers' comments and wants to process them fast. That way we can easily visualize data (customer satisfaction) and quickly see if drastic changes need to be made. Furthermore, it enables us to sort comments based on their importance. It makes sense to first act on comments rated with the lowest score and make our way up. Predicting specific ratings rather than sentiment is useful when we want to get more detailed information on some factors, which make a customer's rating great or poor. Then, we can find common features that make customers' experience poor and improve them in the future but also see what is being done well. In addition, the studies that test the performance of sentiment analysis are rare in the tourism and hospitality domain (Mehraliyev et al., 2022), thus our study also contributes to filling this gap.
In this paper, we have analyzed sentiment and ratings of a specific place or service expressed in customer reviews on TripAdvisor to predict tourist satisfaction. We have conducted sentiment and rating classification using different methods ranging from machine learning algorithms like Naı €ve Bayes and support vector machines (SVMs) to deep learning methods. Experimental results have shown that deep learning methods based on bidirectional long short-term memory (BiLSTM) outperformed other implemented methods. Based on results from his study, tourist service providers can easily and quickly process a lot of data and get very accurate customer feedback, since user-generated content is regarded as the most influential content in the tourism industry. There are other noteworthy benefits of sentiment analysis like shaping company marketing strategies, classification of textual data and providing overall better service.
Literature review Sentiment analysis has been performed with a variety of techniques over the last decade including lexicon-based and machine-learning and deep-learning-based techniques (Jurafsky and Martin, 2000).

JHTI
Lexicon-based sentiment analysis For lexicon-based sentiment analysis, a sentiment relates to its semantic value and the intensity of each word in the sentence, which requires a pre-defined lexicon to classify positive and negative words (Bagi c Babac and Podobnik, 2016). Generally, a text item is treated as a bag of words, and after scoring each word, the sentiment is obtained by a certain pooling operation such as taking an average of individual word scores.
Today many of these lexicon-based approaches are automated, such as using the TextBlob (Loria, 2018), a Python library for natural language processing (NLP). Larasati et al. (2020) used TextBlob to obtain sentiment analysis scores from eight tourist websites, which confirmed most of the visitors' sentiments were positive. In addition, a lexicon-based approach has been used to evaluate consumers' sentiment toward several well-known technological brands (Mostafa, 2013), and sentiment analysis confirmed a generally positive consumer sentiment. Tan and Wu (2011) utilized a lexical database for extracting hotel reviews from Ctrip based on the random walk algorithm for the automated generation of a specific-domain sentiment lexicon. Serna et al. (2016) made use of the WordNet lexical database to obtain emotions from Twitter mentioning two holiday periods. In addition, Kang et al. (2012) proposed a replacement senti-lexicon for the sentiment analysis of building reviews based on an improved Naı €ve Bayes algorithm.
It should be noted that most of the lexicon-based approaches are built upon, so-called, general-purpose lexicons (Avdi c and Bagi c Babac, 2021). Bagherzadeh et al. (2021) developed two specific lexicons, namely weighted and manually selected lexicons, which were tested and validated by applying classification accuracy metrics to the TripAdvisor data. Their approach outperformed a SentiWords lexicon-based method and a Naı €ve Bayes machinelearning algorithm in classifying sentiment.
Machine learning approach to sentiment analysis In the supervised machine learning approach to sentiment analysis in tourism, a variety of classifiers were used (Waghmare and Bhala, 2020). One of the techniques, called Naı €ve Bayes, was used in research on sentiment analysis on hotel reviews using a Multinomial Naı €ve Bayes (NBM) classifier (Farisi et al., 2018). In that study, the authors provided a solution for classifying customer reviews as positive or negative using a NBM classifier using a bag of words to extract features after data preprocessing, which resulted in an average F1 score of more than 91% in experimental results. Likewise, Afzaal et al. (2019) have shown high accuracy of NBM, that is NBM correctly classified 88.08% of the aspects of the restaurants' reviews dataset and achieved 90.53% accuracy in the hotels' reviews dataset.
Another well-known technique called Support Vector Machine (SVM) was used in research on sentiment analysis model for hotel reviews based on supervised learning (Shi and Li, 2011). That paper discusses sentiment analysis using SVM with the Term Frequency-Inverse Document Frequency (TF-IDF) and a bag of words. After conducting the experiments, the results showed that TF-IDF was more effective. TF-IDF resulted in 87.2% and a bag of words 86.4% on the F1 score. In addition, Prameswari et al. (2017) used a similar approach combing TF-IDF with SVM for sentiment analysis of hotel reviews and achieved an accuracy of 78%.
Deep learning approach to sentiment analysis Although previous techniques have given satisfying results, in recent years deep learning is getting more and more used for sentiment analysis and natural language processing tasks (Faralli et al., 2021). A paper on bidirectional recursive neural networks (RNNs) for token-level labeling with structure (Irsoy and Cardie, 2013) proposed an extension to RNN to carry out labeling tasks at the token level that improves sentiment analysis accuracy. Ramadhani et al. (2021) used long short-term memory (LSTM) architecture to classify tourist reviews and Analyzing tourist reviews using machine learning achieved the best accuracy result of 84%. Furthermore, Baziotis et al. (2017) presented LSTM based model augmented with an attention mechanism. Using that model, they ranked very high at SemiEval-2017 Task 4 "Sentiment Analysis in Twitter". Xu et al. (2019) proposed a method based on BiLSTM and compared it with other sentiment analysis methods like convolution neural network (CNN), RNN, LSTM and Naı €ve Bayes. The conclusion of the experiments was that the proposed BiLSTM gave better results on F1 score, recall, and higher accuracy.
In addition to memory-based neural networks, CNNs have also shown satisfactory results in sentiment analysis. Based on a dataset of travel destination reviews, Huang (2021) implemented a sentiment classification model based on a CNN and compared it with several other machine learning models, and the CNN model had the highest accuracy of sentiment classification, reaching 91.6%, Rating prediction from tourist reviews While there are many studies of sentiment analysis and rating prediction in the various domains of interest (Harrag et al., 2019), fewer studies provide a framework for analyzing and predicting ratings from tourist reviews based on machine and deep learning.
Leal Gonzalez-Velez Malheiro et al. (2017) used multiple linear regression to calculate an overall rating to estimate the remaining ratings (feature variables) for HotelExpedia and TripAdvisor datasets. While evaluating the performance of five machine learning algorithms, namely Decision Trees, SVMs, Neural Networks, Random Forest and Naı €ve Bayes algorithms for predicting Google user review rating on the travel experience, Hossain and Das (2020) showed that SVMs provided better results than other algorithms. In addition, Leal et al. (2018) suggested that the rating prediction can be further advanced by using online processing and post-filtering to improve accuracy in online recommendations.
Overall, it can be concluded that one of the fruitful ways to conduct sentiment and rating analysis and prediction is using natural language processing and machine learning (Abadi et al., 2016). The main goal of this paper is to develop, implement and test machine learning models using data from TripAdvisor, which are capable of classifying customer reviews as positive, negative, or neutral, and predicting customer review ratings with one to five stars.

Research methodology Data preprocessing
Preprocessing is one of the most important steps when performing any NLP task (Bagi c Babac and Podobnik, 2016). Basically, preprocessing means bringing the text into a clean form and making it ready to be fed into the model. When it comes to data preprocessing, there are many useful techniques. Specifically in this paper, tokenization is the first step in preprocessing. Tokenization means splitting a sentence into a list of words. After tokenization, removing stop words comes as the next step. Stop words are words that are commonly used in any language. If we take for example English, stop words are words such as "is", "the", "and", "a", etc. Those words are considered unimportant in natural language processing, so they are being removed. Next comes the process of transforming a word into its root or lemma called lemmatization. An example of that would be "swimming" to "swim", "was" to "be" and "mice" to "mouse". Considering that machines treat the lower and upper case differently, all the words will be lower-cased for better interpretation. Finally, all punctuation is being removed which contributes to noise reduction and getting rid of useless data.
To perform preprocessing tasks, spaCy was used, an open-source library for advanced natural language processing in Python. It is multilingual, but for this project, only English JHTI was needed. After loading data for the English language, spaCy enables us to perform tokenization, lemmatization and stop word removal quite easily. Examples of using spaCy and the explained techniques are shown in Table 1.

Word representation
Since computers do not understand words or their context, it is needed to convert text into the appropriate, machine-interpretable form. Word embeddings are mathematical representations of words that give similar representations to words that have a similar meaning (Mikolov et al., 2013). In other words, those representations model the semantic meaning of words. Specifically, those representations are vectors that are positioned in space in such a way that vectors closer to each other have more similar semantic meanings.
The word representation used in this research is called global vectors (GloVe) for word representation as introduced by Pennington et al. (2014). Since then, it gained popularity due to its good performance and simplicity. The GloVe is a log-bilinear model with a weighted least-squares objective trained on a global word-word co-occurrence matrix. That matrix shows words' co-occurrence frequency with one another in a given corpus. The main idea behind GloVe is that ratios of word-word co-occurrence probabilities encode meaning, as shown with an example in Table 2.
If we investigate the example shown in Table 2, we can see some actual probabilities from a six billion word, i.e. token corpus. The table shows how the word ice co-occurs more frequently with solid, but steam co-occurs more with gas. Furthermore, if we look at the word water, we can see that both ice and steam co-occur with it frequently because it is their shared property.
Another way used for representing words by vectors is Term Frequency Inverse Document Frequency (TF-IDF). It is commonly used in NLP tasks because it takes into consideration the relevance of a word in a document and scales it across all documents in a specific corpus. TF-IDF is calculated by multiplying two metrics, namely term frequency, and inverse document frequency (IDF). Term frequency (TF) is the number of times a specific word (term t) appears in a document (d) divided by the total number of words in a document as shown in Eq. (1) (Jurafsky and Martin, 2000). Probability and ratio k 5 solid k 5 gas k 5 water k 5 fashion P(kjice) 1.9 3 10 -4 6.6 3 10 -5 3.0 3 10 -3 1.7 3 10 -5 P(kjsteam) 2.2 3 10 -5 7.8 3 10 -4 2.2 3 10 -3 1.8 3 10 -5 P(kjice)/ P(kjsteam) 8.9 8.5 3 10 -2 1.36 0.96
The main difference between these two described vectorization methods is that TF-IDF is easier to use, but GloVe carries semantic meaning and can understand the context better.

Sentiment analysis using machine learning
For the purposes of this study, Naı €ve Bayes and SVMs were chosen as frequently used machine learning algorithms in data science (Poch Alonso and Bagi c Babac, 2022). Naı €ve Bayes is one of the most commonly used methods in natural language processing tasks. It is based on the Bayes theorem which calculates the probability of a specific event based on prior knowledge using the next equation: where PðcjxÞ is a posterior probability of a class, PðcÞ is the prior probability of a class, PðxÞ is the prior probability of the predictor, and PðxjcÞ is the conditional probability that the predictor is a given class. SVM is a machine learning algorithm that uses a hyperplane to separate different classes of data. A hyperplane is a subspace that is always one dimension less than its parent dimension. For example, if we were in a two-dimensional space then a hyperplane would be a line. The main goal of this algorithm is to find the hyperplane that has the largest distance (margin) between the hyperplane and the nearest data called support vectors. New data is being classified based on which side of the hyperplane they are located. Furthermore, the larger the margin is, the more confidence we have in determining data class.

Sentiment analysis using deep learning
When it comes to deep learning in natural language processing, the first thing that we think of is a recurrent neural network or RNN. The idea behind RNN is to be able to process arbitrary length data while keeping track of its order. Since that approach has some big flaws, for example not being able to capture long-distance semantic connections or vanishing gradient problems, another type of neural network was used. LSTM is a type of recurrent neural network that overcomes previously mentioned problems. To do that, LSTMs use four, instead of one neural network layer (Sherstinsky, 2020).
The reason why LSTMs work so well is their ability to add or remove information to the cell state. Structures called gates enable them that kind of behavior. Gates are different neural networks that consist of a sigmoid layer and a pointwise multiplication operation. The core idea behind that is to forget or update data because the sigmoid layer squishes values between 0 and 1. That way the network can learn what data is relevant or irrelevant and decide to keep it or forget it. The first gate is called the forget gate and it decides which information to keep or discard. The step is demonstrated in Eq. (5), where h t−1 and x t are the inputs to LSTM, W f is the weight, and b f is the bias (Hochreiter and Schmidhuber, 1997).
Next, we want to update the cell state. The second gate, called the input gate, also using the sigmoid layer decides which values to update. Afterward, we combine the result of the input gate with the tanh layer to create the update on the cell state (Hochreiter, 1998).
Specifically, to update the cell state, we multiply the old cell state by the forget gate, then add it with the input gate multiplied with e C t . Described process is shown in Eq. (8). Finally, we have the output gate. Its job is to calculate the next hidden state. As Eq. (9) shows, we first pass the current and the previously hidden state through the sigmoid. Then, to get the output, we put the cell state through tanh and multiply it by the previously calculated sigmoid output. As a result of everything mentioned, we get the new hidden state shown in Eq. (10). In the end, the new hidden state and the cell state are carried over to the next cell (Hochreiter and Schmidhuber, 1997).
Described LSTM model achieves much better results than traditional RNN (Sherstinsky, 2020) but there is still a place for an upgrade. We have seen that LSTM uses information from the past, meaning that the current state depends on the information before that moment. In order to have more contextual information in every moment, i.e. increase the amount of networks information, we use BiLSTM. BiLSTM consists of two LSTMs, each one of them going in a different direction. The first one goes forward (from past to the future) and the second one goes backward (from future to past). That kind of architecture enables us to understand the context much better. Besides RNNs, CNNs have been commonly used for text classification and sentiment analysis tasks, although they are more known for working with images. The difference here is that one-dimensional (1D) convolution is being used instead of two-dimensional (2D) like with images as inputs. One of the biggest CNN's advantages is that they are translation invariant. It basically means that when some pattern is learned, CNN can recognize it later at any other different position. Just as 2D convolution, 1D convolution includes many kernels with weights that are learned through the training process. Those kernels are designed to generate an output by looking at the word and its surroundings. That way, since similar words have similar vector representations, convolution will produce a similar value. In practice, those convolutional layers are combined with pooling layers that discard less relevant information (Kuhn and Johnson, 2013).

Analyzing tourist reviews using machine learning
Model architectures for machine learning For conducting sentiment analysis, a few different methods and architectures were proposed. First, we implemented two machine learning algorithms, namely MNB and SVM. In these machine learning approaches, we used TF-IDF for word representations. After that, we implemented deep learning models using the GloVe for word representations. Our first deep learning model is based on a 1D CNN, i.e. it consists of three 1D convolutional layers combined with dropout and max-pooling layers with three linear layers followed by softmax in the end. Described architecture is shown in Figure 1.
Furthermore, we implemented a model architecture that consists of two stacked BiLSTMs followed by three linear layers with a softmax function at the end. The model's architecture is shown in Figure 2. For this model, word representations are provided by GloVe, thus the word embeddings are used as the inputs of BiLSTM. After passing word embeddings through two BiLSTM layers and text feature extraction, vectors are used as inputs into three linear neural network layers with ReLU activation functions is to perform text classification. Lastly, the output is passed through the softmax function to convert the numerical output into the range [0, 1] representing the probabilities of each class.
In addition, another model has the same architecture as the one shown in Figure 2, but we used normal LSTMs instead of BiLSTMs.

Experimental results
For the purpose of training the models to achieve good performance in practice, the dataset has to be convincing (Cvitanovi c and Bagi c Babac, 2022). Having that in mind, data was extracted from TripAdvisor, the world's largest travel platform that today has over 860 million reviews and opinions (Alam et al., 2016a, b). This study utilized a dataset called TripAdvisor Hotel Reviews that contains 20,491 unique hotel reviews graded from one to five stars by guests (Alam et al., 2016a, b). For training purposes, the dataset was split into three parts, that is training, evaluation and testing subsets in the ratio of 70% for the training, 10% for evaluation and 20% for the testing subset (Kuhn and Johnson, 2013).
Since the used dataset consists of reviews and their scores which are grades from one to five, the machine learning models are first trained to predict the exact grade based on the review text. MNB algorithm resulted in 46% accuracy on the test data while SVM managed to outperform Naı €ve Bayes and achieve the accuracy of 55%.
After machine learning algorithms, deep learning models were trained and tested. Given that grid search is quite exhaustive and time-consuming, a random search was used in the process of setting hyperparameters for training (Vrigazova, 2021). Furthermore, an early stopping mechanism was used and the model with the smallest loss in the evaluation data was saved (Marrese-Taylor et al., 2014). Also, to prevent the model from overfitting, we used the dropout mechanism. In addition, a technique to prevent exploding gradients problem called gradient clipping was used too. Through all training processes, the batch size of 16 examples was constant. Considering the problem of predicting the score using text review can be treated as a classification task, a loss function called categorical cross-entropy was implemented. Categorical cross-entropy is one of the most popular loss functions when it comes to multi-class classification (Neidhardt et al., 2017). It is shown in Eq. (11), where b y i is the i-th value in the model prediction and y i is the true label value.
Finally, an optimization algorithm for stochastic gradient descent called Adam was chosen for model training.
The highest accuracy that the 1D CNN managed to achieve after conducting a random search for setting hyperparameters was 62%. Furthermore, the stacked LSTM model performed expectedly better than CNN based model. The best model with LSTM architecture managed to achieve 66% accuracy. Finally, a stacked BiLSTM model outperformed other models by achieving 72% accuracy. Table 3 shows experimental results of the BiLSTM based model on test data for specific hyperparameters combinations. Figure 3 shows losses in evaluation and training data during the training of the best-performing model.
Finally, we can compare the experimental results of all the above classification methods. An overview of those results is shown in Table 4. The "Rating task" column summarizes the previously explained results, i.e. classifying reviews into five classes (or grades) from one to Learning  Analyzing tourist reviews using machine learning five, and the "Sentiment task" column shows the results from classifying reviews into three classes representing positive, negative and neutral customer experience. During the process of calculating the scores, a review is considered positive if it has a score greater than 3, neutral if the score is 3, and negative if the score is less than 3. Table 5 shows the results for different hyperparameters combinations of the BiLSTM model proposed for sentiment classification, while Figure 4 shows how evaluation and training loss behave during the training process of the model with the highest accuracy.
From the results presented in this section, it can be concluded that deep learning models delivered better overall performance than the existing classical machine learning approaches. It has been shown that by leveraging the BiLSTM-based model architecture with touristic opinion data, higher accuracy in predictions may be obtained. This model's high accuracy and efficiency can help further improve the hotel or tourism industry in better understanding   Although deep learning models outperform other machine learning models in these multiclass predicting tasks, it can be also noticed that the results from the sentiment task also seem satisfactory in certain settings, e.g. 80% for SVM given the fact that simpler pre-processing and less memory consumption were used. Thus, during the decision-making process in a particular setting or environment, one can balance between achieving higher efficiency and accuracy versus utilizing less computational resources. However, for a more complex task such as rating prediction, deep learning models provide significantly better accuracy compared to some other models that do not even provide adequate accuracy (e.g. Naı €ve Bayes with below 50% accuracy).
In the comparison of our results to the results of others (Gitto and Mancuso, 2017;Mehta et al., 2021;Wang et al., 2022), it can be noted that there are differences in the size, quality, and purpose of a particular dataset and different uses and implementations of sentiment analysis. In addition, there are various approaches to calculate whether the sentiment is positive, negative, or neutral, e.g. a study that explored the cruise experiences (Wang et al., 2022) considered a comment as negative "if a comment's positive score was less than or equal to two times the absolute value of its negative score". Moreover, a recent survey on the sentiment analysis in hospitality and tourism (Mehraliyev et al., 2022) has reported that the studies that test the performance of sentiment analysis are rare, thus our results contribute to filling this gap.

Conclusions
Given the vast amount of data on people's individual opinions, there is a need to develop and improve existing sentiment analysis tools. These tools not only serve the individuals as a recommender on how to optimize their choices of services to use, but also to decision-makers in improving the quality of their services. The long-term implications of the knowledge gained by these sentiment tools may influence tourism development and the engagement of tourist stakeholders. Our contribution in the form of proposed models can indicate a plausible Analyzing tourist reviews using machine learning further direction for developing more robust and accurate models for sentiment and rating classification. More specifically, this study provides an insight into how to apply machine and deep learning models for sentiment analysis on tourist reviews. It showed that the BiLSTM model outperformed in both sentiment and rating classification tasks. Specifically, in our BiLSTM model, data were first passed through two stacked BiLSTMs whose job was to gather contextual information followed by three linear layers that perform classification. Models were trained to classify reviews first into five and later into three classes with GloVe used for word representation. As result, the best performing model for five classes achieved 72% accuracy while the best model for three classes surpassed accuracy by 89%. For methodological comparison, other models based on machine learning called Naı €ve Bayes and SVMs were implemented as well as other deep learning models like 1D convolution and LSTM. Experimental results have shown that the deep learning model based on BiLSTM achieved the best results in both tasks.
Our results confirmed that deep neural network algorithms are more accurate than machine learning algorithms (Waghmare and Bhala, 2020). Deep neural networks in general need less time as less human intervention is needed, and they perform automated feature extraction. However, to produce appropriate accuracy and performance, they require larger amount of data than machine learning algorithms and the training costs are also high.

Theoretical implications
In recent research based on customer comments in the hospitality and tourism field, the three themes were identified as most relevant, those are behavior, social media, and marketing related to user-generated data (Mukhopadhyay et al., 2022). In addition to the use of sentiment analysis to gain insights from these user-generated data, various text mining techniques are used depending on a particular research goal, e.g. a recent study that analyzed online reviews from TripAdvisor has applied sentiment analysis, clustering, topic modeling, and machine learning algorithms for real-time classification (Gour et al., 2021). Furthermore, sentiment variables were investigated "not only as independent but also as dependent variables" (Mehraliyev et al., 2022). For instance, Kim and Han (2022) used regression analysis to understand the impacts of the length of stay at hotels on online reviews.
A systematic review of sentiment analysis literature in hospitality and tourism from methodological and thematic perspectives confirmed that "testing the performance of sentiment analysis was uncommon" (Mehraliyev et al., 2022), and our study contributes to filling this gap by providing performance results of sentiment analysis based on different machine and deep learning models. While most studies use sentiment analysis as a tool to find insights into customer opinions, our study provides a methodological framework to create and customize sentiment analysis models based on machine and deep learning approaches.
Sentiment analysis theoretics might find fruitful insights from methodological aspects of this study, for instance, when investigating an appropriate model architecture for the particular purpose and domain as well as fine-tuning the parameters of the machine and deep learning models. This study provides detailed methodological insight into several different models, their architectures and complete training and testing processes. Furthermore, different word representation models like TF-IDF and Glove are implemented and compared. It can also be noted that our approach goes beyond hospitality and tourism domain.
Another direction for exploring optimal models for rating prediction is the use of other features as input to the model. Additional features may also be learned from the text, e.g. by content analysis (Wang et al., 2022), or topic analysis (Gour et al., 2021). JHTI Furthermore, the touristic insights made from these models may provide a basis for the understanding of tourist behavior patterns and setting up a theoretical framework for shaping public opinion, i.e. distilling variables that contribute to making opinions. Such a framework can contribute to establishing short-, mid-, or long-term marketing and other relevant strategies for companies and organizations given a particular touristic context.

Practical implications
State-of-the-art natural language processing technologies enable high-quality, fast and efficient analysis of the text that enables a holistic and deep understanding of user experiences on various topics. As tourists often use the online reviews of others as a primary source of information (Kim and Han, 2022), the insights from the analyses of online reviews are a valuable resource of knowledge for many stakeholders in tourism. Sentiment analysis techniques in the tourism industry can be utilized for forecasting tourist expenditure and the number of arrivals and gaining insights into the tourists' profiles. Besides that, they can process a large amount of data and give priceless feedback since it is produced based on user-generated content.
There are other meaningful benefits of sentiment analysis in tourism like using that information for improving overall customer experience and upgrading marketing strategies. Thus, obtaining highly accurate results from sentiment analysis can be used as a resourceful and purposeful crowd intelligence summary that can help decision-makers and different stakeholders in tourism in strategy planning, decision making, and marketing activities. Different service sectors can use the implemented models in different domains of interest as a means of satisfaction recognition of their own products and services as well as a tool to predict the opinions on upcoming challenges.

Limitations and future research
There are several limitations regarding the sentiment analysis research of user-generated data. First, in our study, we assumed the appropriate credibility of all explored reviews in our dataset. However, there is a growing field of research that investigates possible fake information, thus warning both the users and service providers about the continuous need for updating and analyzing the factors that can influence the perceived credibility and quality of online information (Reyes-Menendez et al., 2019). It has been shown that the deep learning approach such as presented in this study achieves promising results even in fake information detection (Cvitanovi c and Bagi c Babac, 2022). In addition, aspect-based sentiment classification methods have shown promising results in suppressing the noise within this kind of data (Afzaal et al., 2019).
Another limitation is the use of a single data resource, so future avenues of this research can be in the direction of data enlargement as well as the increase of data sources. Data from other social media have already proven to be beneficial for gaining valuable insights from tourists (Serna et al., 2016). In addition, exploration of more recent and more rich datasets should enable a comparison with the current data and results. Different data types coming from various domains such as transport, environment, weather, etc. may be combined with sentiment scores to investigate the unexplored patterns. Furthermore, review classification using natural language processing has future scope for handling multilingual review classification.
Regarding future work, possible improvements can be realized using transformers, a new architecture based on the attention mechanism and transfer learning (Zhang et al., 2018). Transfer learning enables more efficient leveraging of computational resources and overcomes the limitations such as the need for large, labeled data for training accurate models like recurrent or CNNs. Training these models is less time-consuming and expensive, and the trained model could be easily adapted to a new task, e.g. with a different set of labels (Vaswani et al., 2017). Analyzing tourist reviews using machine learning