The role of user-generated content in tourism decision-making: an exemplary study of Andalusia, Spain

Purpose – This research proposes to organise and distil this massive amount of data, making it easier to understand. Using data mining, machine learning techniques and visual approaches, researchers and managerscanextractvaluableinsights(onguests ’ preferences)andconvertthemintostrategicthinkingbased on exploration and predictive analysis. Consequently, this research aims to assist hotel managers in making informed decisions, thus improving the overall guest experience and increasing competitiveness. Design/methodology/approach – This research employs natural language processing techniques, data visualisation proposals and machine learning methodologies to analyse unstructured guest service experience content.Inparticular,thisresearch(1)appliesdataminingtoevaluatetheroleandsignificanceofcriticalterms andsemanticstructuresinhotelassessments;(2)identifiessalienttokenstodepictguests ’ narratives based on term frequency and the information quantity they convey; and (3) tackles the challenge of managing extensive document repositories through automated identification of latent topics in reviews by using machine learning methods for semantic grouping and pattern visualisation. Findings – This study ’ s findings (1) aim to identify critical features and topics that guests highlight during their hotel stays, (2) visually explore the relationships between these features and differences among diverse types of travellers through online hotel reviews and (3) determine predictive power. Their implications are crucial for the hospitality domain, as they provide real-time insights into guests ’ perceptions and business performance and are essential for making informed decisions and staying competitive. Originality/value – This research seeks to minimise the cognitive processing costs of the enormous amount of content published by the user through a better organisation of hotel service reviews and their visualisation. Likewise,thisresearchaimstoproposeamethodologyandmethodavailabletotourismorganisationstoobtain trulyuseableknowledgeinthedesignofthehotelofferanditsvaluepropositions.


Introduction
Tourism in Andalusia (Spain) is relevant due to its impact on regional production and employment.Tourism is, in fact, its first industry, representing 6.5% of its GDP in 2021 (Δ 35.8 % and Δ 49.8% of tourist flow compared to 2020).Furthermore, according to the Hotel Occupancy Survey (National Statistics Institute, 2022), Andalusia recorded 5,682,276 hotel overnight stays in September 2022 (1,961,146 travellers), notably advancing from 4,696,783 overnight stays in September 2021 (1,579,073 travellers).The key to tourism in Andalusia is also contextualised in the growing use of information and communication technologies (ICTs) to consult, book or purchase tourism services, mainly hotel services.Five years ago, in 2017, 61.2% of tourists reported using ICT to consult, make a reservation or purchase services during their trip to Andalusia, up 4% compared to the previous year (Balance of the Year of Tourism in Andalusia, 2017).Therefore, its use represents a critical opportunity for the sector.For instance, ICTs increase the possibility of cost reduction by eliminating the intermediation relationship and the availability of new (and ubiquitous) communication channels between tourists and organisations.Furthermore, ICTs transmit diversity, wealth, quality, safety, complementarity and differentiation.According to the Andalusia Horizon 2020 General Plan for Sustainable Tourism in Andalusia, this development has changed how tourists make their decisions due to the exchange of opinions and experiences through applications, social networks and other spaces on the Internet.
In this sense, the hospitality domain is evolving towards other models based on the publication and consultation of user-generated content (UGC) in online booking systems (Raguseo et al., 2017;Sparks et al., 2016;Lyu et al., 2022).Guests seek advice before booking a hotel.They consult reviews published on information-mediation platforms (e.g.Booking, TripAdvisor, Expedia and Yelp, among others) and assess the ratings of other guests about their stays in hotel establishments.Online reviews are spontaneous, enlightening and even passionate, easily accessible from anywhere and at any time (Alarc on- Urbistondo et al., 2023;Guo et al., 2016;Zhu et al., 2020).Reviews are memories or cognitive reconstructions of a trip or stay that appear to reduce the potential risk of purchase noticed by other users (Sparks et al., 2016).Although shared content that recreates guest experiences can be distorted, the information is perceived as credible.
Consequently, our research seeks to extend the published research on informationmediation platforms by applying an interpretation framework based on the discipline of relationship marketing.Our analysis proposes to identify critical terms and topics inferred from the unstructured content generated by the guests on the services demanded by the guests of hotel establishments.Therefore, our methodology uses a multifaceted approach, merging both qualitative and quantitative research paradigms.It uses an automated tracking system for data collection, where hotel, review and guest data are recorded from specified locations.Our analysis of these data includes structured and unstructured elements, with an emphasis on user-generated reviews as a key source of information.
Our study delves into the complex analytical environment of tourism in Andalusia, focussing primarily on the intricate interplay of features that affect the hospitality industry and guest satisfaction by understanding needs, providing reliable service and building trust (relationship quality, hereinafter, RQ; cf.classic texts such as those of Gundlach et al., 1995;Morgan and Hunt, 1994;Parasuraman and Grewal, 2000).Our study aims to examine the role of ICTs in shaping consumer behaviour and their consequent influence on hotel services and demand.Our research further investigates the correlation between variables associated with guests, e.g.traveller type and length of stay, and the narratives produced by the guests, with a substantial reliance on data visualisation as an analytical tool.
Moreover, our research employs the graphical visualisation of results.Data visualisation is a fundamental tool for analysing and understanding guest reviews.It enables users to effectively and quickly interpret data, helping decision-making and identifying patterns and trends.Visual representation of data is also more accessible to the human brain.Specialised visualisation tools are necessary to effectively explore and understand data, assuming that traditional data analysis methods can be inefficient and impractical for handling large datasets.However, there are potential limitations in using data visualisation.In particular, the reliance on data visualisation may lead to the oversimplification of complex data, resulting in biased or incomplete conclusions.To avoid these mistakes, our research starts by defining its objectives before collecting and preparing its data for visualisation.In addition, our approach concludes the importance of choosing the appropriate chart type for the data and message and ensures that the visualisation is legible and easy to understand.
In summary, our research seeks (1) to minimise the cognitive processing costs of the enormous amount of content published by the user through a better organisation of hotel service reviews and their visualisation, and (2) to propose a methodology and method available to tourism organisations to obtain truly useable knowledge in the design of the hotel offer and its value propositions.Our selected statistical analysis also includes various algorithms to identify the semantic structures behind UGC.These include Scattertext for highlighting the most salient terms, word shifts graphs for visualising text comparisons and Non-Negative Matrix Factorisation (NMF) for topic modelling.Finally, our research validates results with various statistical measures, including the Matthews correlation coefficient (MCC) and SHapley Additive exPlanations (SHAP).The Research Method section provides details of the data gathering, data mining and findings on hotel stays.Finally, the Discussion section presents future research lines and theoretical and managerial implications.

Theoretical framework
The RQ refers to the general assessment of the strength of a relationship and is often associated with relationship satisfaction, trust and commitment.RQ is conceived as a suitable approach to explain and predict a relationship's success and is based here on commitment theory in business relationships (cf.Gundlach et al., 1995;Morgan and Hunt, 1994;Parasuraman and Grewal, 2000).In the context of the hotel industry, RQ is defined as the extent to which a hospitality relationship is able to fulfil the needs of guests; that is, high relational quality can lead to increased guest satisfaction as it often involves understanding and meeting the guest's needs and expectations, providing reliable and consistent service, and building a sense of commitment between the hotel and the guest (Mody et al., 2019).In this regard, commitment theory posits that the more committed a guest is to a hotel, the more likely they are to continue doing business with the hotel, for example, repeat bookings, positive reviews related to hospitality features and word-of-mouth recommendations.
In particular, in a relational context, when comparing the results of hospitality services with guests' expectations, satisfaction could be conceptualised as the guest's sensation of pleasure or disappointment resulting from a stay.Satisfaction, which is related to the service provider's performance, is a key measure of a hotel's effectiveness at outperforming other hospitality services.According to a cognitive approach, satisfaction is also formulated as the affective response to the congruence between the result and the standard of comparison (the disconfirmation of expectations model, Oliver, 1997Oliver, , 2010)).Following the atmospheric proposal, our study could identify intangible features (accommodation ambience, among others) and tangible features (accommodation amenities, for example) that influence guests' pleasure, arousal and willingness to return.For example, Belarmino et al. (2019) conclude that room facilities are a dominant topic for hotel guests.Amenities attract guests who prefer the feeling of being home instead of staying in a conventional hotel."Hotel amenities play a significant role in guests' decision-making processes and service experience [. ..], including satisfaction" (Yu et al., 2022, p. 3168), which can even justify higher prices.
Thus, the expectations of guests, the characteristics-based evaluation, the emotional evaluation and sensory attributes play a crucial role in generating satisfaction Tourism decisionmaking (Bagozzi et al., 1999;Baker et al., 1992;Lazos and Steenkamp, 2005;Mudie et al., 2003;Oliver, 2010;Rodr ıguez and San Mart ın, 2008;Yu and Dean, 2001).And precisely, our research on the UGC helps hosts and hotel managers identify the reasons of guests and provide better services to improve guests' experience, hospitality service reputation, willingness to return and even willingness to accept a higher price.Firstly, the literature on hotel guest satisfaction and dissatisfaction highlights key findings with important implications for hotel managers (e.g.Sann et al., 2022).Accessibility or standardisation are critical factors for short stays, i.e. being within walking distance of major attractions such as the city centre and transportation hubs and having lower risks associated with the hotel.The proximity allows guests to save time and effort on the commute, thus enhancing their overall travel experience.Following S anchez-Franco and Aramendia-Muneta (2023), "guests emphasise drivers such as location and accessibility (e.g.walking distance from major attractions such as the city centre, the transportation hub, or the beach), lower risks-through standardisation, regulations, and reputation (. ..)".
A hotel located near a metro station provides guests with the convenience of exploring different parts of the city with ease.Easy access to transportation options (for instance, public transport or car parking) thus becomes an essential factor in a guest's decision-making process.Moreover, location can be a determinant of perceived safety and security-hotels in more central or high-footfall areas are often seen as safer, a factor of particular importance for families.Although service differentiation is the key to standing out in the competitive hospitality market, location greatly improves the value of these differentiated services."Location and accessibility (. ..) help customers find the hotel easily, provide a good view of the surroundings and save time for customers seeking to visit nearby places of interest" (Xu and Li, 2016, p. 61;, cf. also Sim et al., 2006).Accessibility is, therefore, a crucial factor in a guest's decision-making process, particularly when choosing a hotel for short or family trips (Poon and Huang, 2017).The proximity to major attractions or hot spots not only allows guests to explore the surrounding area, but, when combined with differentiated services, also provides a comprehensive and convenient travel experience.
Moreover, standardisation precisely refers to the consistent quality and services provided by hotels, e.g.in-room services, airport shuttle services and free parking.Traditional hotels generally operate within established frameworks to deliver uniform service quality.In contrast, peer-to-peer (P2P) accommodation platforms such as Airbnb can provide distinctive and tailored guest experiences, yet frequently lack the standardisation and oversight characteristic of the conventional hospitality sector.This potential absence of conventions and controls may engender risks for hosts.The personalised experiences offered by P2P accommodation platforms contrast with the regulated standards of traditional hotels.Whereas Airbnb listings provide unique, tailored stays for guests, they lack the regulations and standardisation that hotels reliably offer.Hotels adhere to safety standards and regulations, ensuring guests have a consistent, risk-averse experience.In sum, while P2P platforms prioritise uniqueness, hotels focus on consistent standards which provide guests with security and peace of mind.
Hence, standardised facilities and services offer a sense of familiarity and assurance to guests, facilitating their exploration of the surroundings.For example, knowing that a hotel provides a reliable shuttle service to key points of interest can encourage guests to explore the local area.Similarly, a hotel that offers comprehensive business services, for example, meeting rooms or conference facilities, can be particularly valuable for business travellers.If this hotel is also located in the city centre or near major business districts, it can provide added convenience for these guests, thus improving the overall value proposition.In contrast, if it is also located near a beach or a natural park, the overall relaxation experience for the guests can be significantly enhanced due to the serene surroundings.Therefore, the value of these differentiated services can be significantly improved when combined with a strategic location.
Secondly, hotels design their services to create unique value propositions for their guests.In addition, providing a local editorial perspective, local insider tips and practical information on neighbourhoods can facilitate an authentic local experience for guests.Furthermore, guests place a high value on property characteristics -e.g.open spaces, increased safety, car parking or gym services-and a variety of amenities and services -e.g.wellness facilities, dining options, business services amenities, Internet access, minibars, TV streaming services, hair dryers or coffee makers (Radojevic et al., 2015).Hotels that offer car parking can also be perceived as more secure and provide additional convenience for guests.
Thirdly, hotels differentiate their services to ensure survival (Ben ıtez-Aurioles, 2019).As S anchez-Franco and Aramendia-Muneta (2023) suggest, hotels cultivate a traditional deliveryfocused paradigm that stresses service quality and facilities (Chu and Choi, 2000;Dann et al., 2019;Kandampully and Suhartanto, 2000).Hotels offer housekeeping services, guest loyalty programmes (related to money saving) and facilities to make them less vulnerable to competition (see also Festila and M€ uller, 2017;Young et al., 2017).Furthermore, hotels prioritise the quality of interactions between guests and employees, which is crucial for guest satisfaction (Osman et al., 2019;Parasuraman et al., 1988).As Xu and Li (2016, p. 63) conclude, "staff performance seemed to be among the most influential factors in determining customer satisfaction [. ..] that can strengthen the customer's relationship with the hotels".The staff is easily accessible and encourages engaging guest interaction.In addition, hotels create opportunities for social and recreational activities or events that encourage guests to engage with each other and build a sense of community.On the one hand, quality interactions between guests and employees can significantly improve guest satisfaction.On the other hand, hotels create a more pleasant and satisfying experience for their guests, leading to greater customer loyalty.In sum, a hotel employee who goes above and beyond to assist a guest, such as providing personalised recommendations for local attractions or promptly addressing any issues, can create a positive impression on the guest and lead to increased satisfaction and a higher likelihood of the guest returning or recommending the hotel to others.
Accordingly, our research proposes precisely a service-orientated method to explore the latent qualities extracted from the experience narrated by the tourist (e.g.location of the establishment, service quality and perceived value, staff, sleep and comfort, amenities and related services, cleanliness, hotel atmosphere, among others).The emergence of UGC precisely reshapes how customers share their experiences and make decisions, creating an untapped source of data that offers nuanced insights into guest preferences and expectations.Following S anchez-Franco et al. (2019), "UGC plays an increasingly important role in consumer attitudes and purchase intentions, particularly in relation to travel services (Litvin et al., 2008;Liu and Park, 2015;Marchiori and Cantoni, 2015;Wu et al., 2017)".Our study, in this sense, emphasises that the spontaneous and multifaceted nature of UGC can supplement traditional survey-based research, revealing hidden dimensions of the guest experience that could have remained uncovered.Furthermore, our research assesses the fulfilment of expectations, predictions, goals and desires in the context of the relationship.In this regard, UGC tends to be more empathetic than other one-dimensional metrics.UGC highlights the utilitarian, affective, social and symbolic aspects of consumer experiences in their natural setting, without interference from researchers (cf. S anchez-Franco et al., 2016).In this context, the media systems dependency theory suggests that consumers who rely heavily on a particular medium (e.g.community-based online services) are more susceptible to attitudinal and behavioural changes stemming from that community (cf.Ball-Rokeach, 1985).Therefore, this analysis can provide a more precise, dynamic and detailed understanding of the determinants of guest satisfaction, allowing hotels to refine and calibrate their services and formulate value propositions that precisely meet guest needs.
As Zhu et al. (2020) conclude, reviews are multifaceted and incorporate richer content that a single scalar value cannot fully capture (Archak et al., 2011).Structured questionnaires sometimes introduce various response biases resulting from the question's wording.A J-shaped distribution characterises the scores, which tend to be positive (Zervas et al., 2021), Tourism decisionmaking perhaps motivated by fear of retaliation (Dolnicar, 2018).Additionally, traditional questionnaire-based research requires an arduous effort for data acquisition.The questionnaires are not regularly updated compared to the dynamism of the tourism sector and require excessive input from respondents.In contrast, data from information mediation platforms are accessible to the researcher.Free and natural opinions are labelled thematically, spatially and temporally.They tend to contain less bias due to influence.And they ultimately offer a vital and abundant opportunity for scientific studies on tourism once the researcher removes the noise they have.In sum, UGC presents a way to spread information dissemination and inform travel decision-making.By sharing their travel experiences through content, images and videos, customers enhance the amount of data available data for future potential travellers, encompassing new markets, topics and sensitive matters.The upto-date and easily accessible customer feedback provided through UGC serves as a modern form of digital word-of-mouth (eWOM) (Mitsopoulou et al., 2023).Xu et al. (2023) demonstrate the indirect influence of UGC on travellers' intentions to revisit a destination, as well as on word-of-mouth (WOM) transmission through perceived image and satisfaction experienced with that destination.Furthermore, empirical findings confirm the importance of considering the UGC as a key contemporary source for destination image formation.To sum up, the influence of UGC on consumers' attitudes towards a brand and their purchase intention has been widely recognised in the recent literature (Chevalier and Mayzlin, 2006;Godes and Mayzlin, 2004;Martins Gonçalves et al., 2018).
By applying data mining and machine learning techniques, our study cleans and extracts the essential elements of the original text (topics and thematic communities) and generates a simplified and understandable version of the text concerning the guest profile and influencing the enquiry, booking, purchase and repurchase of tourism services (cf.Litvin et al., 2008;Veloso and Gomez-Suarez, 2023;Vermeulen and Seegers, 2009).Furthermore, as S anchez-Franco et al. (2022) point out, user-created online reviews play a crucial role in building hotels' reputation (through eWOM, cf.Chong et al., 2018;Hennig-Thurau et al., 2004;Jalilvand and Samiei, 2012).Analysing natural and unstructured narratives attracts users and keeps them loyal (e.g.Gretzel and Yoo, 2008;Park et al., 2007;S anchez-Franco et al., 2018;Ye et al., 2011).In conclusion, adopting a customer-focused, data-driven paradigm can markedly augment a hotel's competitive advantage and financial performance.As society advances further into the digital era, leveraging the utility of UGC and sophisticated analytics will become increasingly instrumental for the prosperity of hotels.

Research objectives
Our study aims to employ data mining techniques to quantitatively assess the influence and relevance of key terms and semantic structures in hotel evaluations.This approach allows us to uncover potentially hidden, yet impactful, aspects of guest feedback, a significant step towards enhancing the guest experience and hotel offerings.As a prerequisite to elucidating our methodology, our research outlines the following structured research objectives: (1) Data Preprocessing: Our aim involves the preprocessing of data, intending to sanitise and streamline the structure of our review corpus.This process provides a clean dataset for further exploration and analysis, allowing for more accurate interpretations.
Key-Term Extraction: A sub-aim lies in the retrieval, filtering and extraction of critical terms that characterise each narrative or published opinion.This extraction is carried out on both the frequency of occurrence of a term and the amount of information it represents, allowing an in-depth understanding of prevalent themes.
(2) Handling Extensive Document Archives: Our study uses pattern discovery and visualisation techniques based on machine learning methods, thereby reducing the complexity of the large data set.Our study also proposes to manage the extensive archive of document reviews through the automatic revelation of latent topics.
Contribution of Semantic Structures: Our primary sub-aim here is to apply data mining techniques to evaluate the contribution and significance of semantic structures in hotel appraisals, providing a more nuanced understanding of guest feedback.
Identifying Relevant Topics: The secondary sub-aim involves identifying the essential topics (by NMF topic modelling) associated with guests' experiences.
Topic Importance: Our tertiary sub-aim centres on the identification of the importance of each topic.This process is achieved by applying XGBoost, using SHAP values.
(3) Semantic Grouping: The present study seeks to facilitate and accelerate the interpretation of semantic structures by clustering topics using the K-Prototypes algorithm.This results in a more intuitive understanding of the topics, allowing faster insight and more informed decision-making.
(4) Proposing an Automated Review Evaluation System: Our final objective is to propose the development of an automatic review evaluation system.This system would predict the quality of reciprocally beneficial relationships with hotel establishments, which would reduce the time spent searching and sorting through the available documentation.This proposition could lead to significant improvements in how hotels manage and respond to guest feedback, enabling them to enhance the guest experience more efficiently.

Research method 4.1 Data collection
Andalusia is selected as a geographical, cultural, social, economic and political area of analysis because of its prestigious diversity in the tourism sector analysed, that is, the hotel sector.Andalusia offers a catalogue of hotel establishments based on sun and beach, business and congress and cultural, urban and rural areas.In particular, our research analyses structured and essentially unstructured data published on online content infomediation platforms (in our case, Booking.com) and focuses on the cities of Seville, Cordoba and Granada, three cities with significant points of cultural interest (museums, monuments and other assets of cultural interest).Seville, Cordoba and Granada account for 33% of tourists visiting the Andalusian region (Hotel Occupancy Survey, INE, September 2022).

Retrieving and extracting information from central recommender systems
An automated tracking system is run for data collection to record data related to the hotel, the review and the guest (reviewer) in a specific location indicated.The procedure is summarised in the following steps: The proposed methodology employs a Python-based web scraping approach to systematically collect and structure hotel review data across multiple locations from Booking.com.Initially, the list of study locations is defined (Step 1).For each location, the programme automatically extracts hotel names exceeding three stars from Booking.com using request queries (Step 2).Subsequently, the software iterates through each hotel on Booking.

Tourism decisionmaking
com, gathering embedded structured data such as rating and date, along with unstructured review titles and texts via additional queries (Step 3).Reviews are stored as dictionaries containing the location, hotel name and other variables.The dictionaries are then used to construct Pandas data frames for analysis.The entire scrape-structure-store process is repeated for each location (Step 4), enabling the automated aggregation of multisite structured and unstructured Booking.comhotel review data systematically.It is stressed that no individualised analyses are performed and the data collection is for purely academic purposes.
The proposed approach thus provides an efficient and scalable means of compiling large corpora of textual data for text analytics and modelling.The methodology is generalisable across review domains and websites.Retrieval and extraction aim to (1) retrieve and extract structured data associated with user reviews from hotel establishments, (2) retrieve and extract the set of features of reviews written in natural language to obtain a new set of nonredundant features and (3) produce structured data patterns that make data analysis feasible.The metadata are city and hotel, type (group or friends, solo, couple or traveller with children, family) and length of stay, among others.
Once the data are downloaded between 1 and 3 September 2022, our research individualises the reviews into positive and negative reviews (see Figure 1) and filters the English narratives with at least 80 characters, avoiding excessively concise reviews.The final number equals 23,545 reviews (14,317 positive and 9,228 negative) once normalised and cleaned (see Section 3.3.Data cleansing) between September 2019 and August 2022.
Finally, our study assesses the quality of the online review.Several published studies have explored the influence of review quality on guest evaluations from different aspects, e.g.review length, review structure, readability and writing style.In particular, Forman et al. (2008) reveal that the readability of the review has a positive effect on its usefulness and spelling errors have a negative impact.The Flesch Reading Ease metric, calculated using the textstat 0.7.3 Python package, assigns readability scores on a scale from 1 to 100, with higher values denoting greater legibility.Scores ranging between 70 and 80 correspond to an eighthgrade reading level, indicating that such texts should be reasonably comprehensible for the average adult reader.The negative reviews published on Booking.com and analysed in our study, reach a Flesch Reading Ease index equal to 75.22 (fairly easy to read).On the contrary, the value of positive reviews drops to 70.26 (fairly easy to read).Moreover, there are no appreciable differences depending on the type of traveller (values around 72 points).There are also no significant differences according to the traveller's origin (values around 70-73 points) or the length of stay (values around 72 points).

Data cleansing
Since the representation of the reviews may correspond to a high-dimensional space, methods must be applied to clean and structure the input text and identify a simplified subset of the corpus features that can represent it in subsequent analysis.For this purpose, a normalisation process (cf.Cotelo et al., 2015) based on the automatic transformation of the documents is

Example of positive and negative reviews on
Booking.com MD carried out to eliminate errors and expressions typical of the jargon used in the field of social networks (e.g.abbreviations, words with repeated letters or errors, textual emoticons, ASCII art, stop words, among others).In addition, the conversion of different forms to a lower number (lemmatisation) is carried out, among other normalisation tasks.

Exploratory data analysis: results
Our research applies specific algorithms to extract underlying patterns from the data, thereby gaining knowledge and understanding of the described phenomenon (cf.Rygielski et al., 2002).This section focuses on applying algorithms for the identification and differentiation of the various semantic structures (posted by hotel guests) analysed from the processing of the content posted in the hotels and, consequently, their contribution to the levels of relational quality detected through sentiment analysis.This phase is subdivided into three subtasks: (1) Subtask 1 focuses on visualising the most characteristic words in a category compared to others (using Scattertext 0.1.9,Kessler, 2017).It extracts a scored list of the most prominent sentences from a review by applying the PyTextRank 3.2.4package-a modified version from Mihalcea and Tarau (2004) and Phrasemachine (Scattertext 0.1.9)to identify noun phrases.
(2) Subtask 2 visualises pairwise comparisons between texts using word shift graphs, that is, a method for identifying which words contribute to the difference between the texts being compared (Shifterator 0.3.0,Gallagher et al., 2021).
(3) Subtask 3 focuses on modelling topics and their contribution to class prediction (relatively positive or relatively negative review).

Subtask 1: visualisation of the most characteristic words of a category in comparison to other categories
Initially, our research creates a scatter text graph that shows which words are associated with relatively positive (SAT) versus relatively negative categories (DISSAT) when guests describe their hotel stays.For this purpose, our study uses the Scattertext package (Kessler, 2017).Scattertext is a Python package used for generating interactive visualisations of text data, particularly for analysing and comparing the usage of words across different categories of text.Scattertext helps perform sentiment analysis, topic modelling and text classification tasks.Researchers can use it in different types of text data, e.g.customer reviews, news articles and social media posts.
In particular, Scattertext allows identifying words and phrases that are disproportionately frequent in one text category while also providing a way to compare the overall usage of words across different categories.Scattertext here identifies the most characteristic words in two texts based on the frequency with which each word appears in one text compared to the other, having eliminated adjectives and adverbs in our case.Figure 2 shows the visualisation of word usage between positive comments (SATISF) and negative comments (DISSAT) written by guests.Our study establishes the RankDifference()to determine the word scores in creating the scatterplot.Illustratively, the word "sight", located at the top left of Figure 2, has a score of 0.73127 with 373 mentions in SATISF (y-axis) and 18 in DISSAT (x-axis).Its graphical coordinates are (6, 103); 6 (for 25,000 terms) is the DISSAT coordinate, and 103 is the SATISF coordinate.Also, it shows 26 per 1,000 documents (SAT frequency) and 2 per 1,000 documents (DISSAT frequency).
The words on the x-or y-axes show high precision, i.e. high discriminative power regardless of their frequency.The closer a point (word) is to the top of Figure 2, the more Tourism decisionmaking (1) In the upper left corner (Figure 2), words such as sight, heart, attraction, cordoba [C ordoba], distance or tapa, among others, are used frequently in positive reviews and rarely in negative reviews.
(2) Words frequently used in negative reviews and words infrequently used in positive reviews occupy the bottom right corner of Figure 2.These keywords include hear, smell, toilet, lack, corridor, wall, or phone.
(3) The most characteristic (common) terms of both sets of documents (stop words) tend to appear in the upper right corner (Figure 2).In this sense, the words commonly used in both types and with little differentiation are bed, breakfast, staff, pool, park, or bathroom, among others.
In summary, Figure 2 performs a thorough lexical analysis, distinguishing words most frequently featured in positive and negative hotel reviews.Terms such as sight, heart, attraction and distance emerge as salient features in positive reviews, emphasising the importance of the location of the hotel and the tourist attractions (cultural and gastronomic) of the city in the satisfaction of the guest.On the contrary, terms such as hear, smell, toilet and lack are linked to the hotel's primary service and comfort, highlighting noise, odours and sleeping comfort problems, i.e. comfort-related issues such as noise pollution and cleanliness.While tourist attractions may serve as potent marketing instruments, a lapse in essential comforts can substantially detract from a guest's overall experience.Furthermore, proportional changes are easy to interpret, but simplistic in extracting exciting differences between two texts (Gallagher et al., 2021).Scattertext allows the use of the singular value decomposition (SVD) technique with three factors, and our analysis proposes the relative positions of key terms after removing adjectives and adverbs.In Figure 3, our study represents the first two singular values (the result of SVD decomposition), locating each term on the x-axis (first singular value) and the y-axis (second singular value).SVD partly confirms the above results, which show words associated with the core services of a hotel, namely, the room and its services related to sleep and the bathroom, and representative of negative reviews (DISSAT).In contrast, terms about attractions and their distance from the hotel, views and amenities of the recommended areas of the destination (e.g.restaurants) are associated with positive reviews (SATISF).
In this regard, Figure 3 elevates the discussion through the SVD technique, validating and refining the insights gained from Figure 2. It becomes evident that shortcomings in core hotel services, for example, the room itself and its ancillary features, predominantly shape negative reviews.However, terms associated with location, scenic views and nearby attractions play a positive role.For management, our results require a bifurcated strategy: (a) enhancing basic comforts for a satisfactory stay and (b) promoting the hotel's locational advantages of the hotel as unique selling propositions.
Furthermore, our study extracts specific phrases for each category using Phrasemachine (Scattertext 0.1.9),which allows a better contextualisation of the distinctive cues between classes.For example, in Figure 4, in the case of positive reviews, the expressions "excellent location' (0.91486), 'friendly staff' (0.86630), a spacious room (0.84879) or 'perfect location' (0.84710) achieve the highest scores.On the other hand, the distinctive expressions of the Tourism decisionmaking negative reviews are the following: didn't [did not] work (À0.85441), room door (À0.82114), room window (À0.81220), noisy night (À0.77065) or double bed (À0.75652).
Therefore, Figure 4 uses Phrasemachine to delve into specific phrasal patterns indicative of guest sentiment.Phrases related to an excellent location and friendly staff emerge as leading indicators of positive guest experiences.In contrast, "didn't work" or "noisy night" are revealing of negative experiences.These findings imply that micro-interactions and amenities are not mere experiential details; they wield the capability to delineate the entire guest experience and, then, require judicious management.
The consolidation of findings from Figures 2-4 explains that both external and internal elements critically contribute to the architecture of guest reviews.While aspects such as scenic views and local attractions can be instrumental as marketing levers, basic characteristics relating to guest comfort are critical.Hospitality management would be well advised to formulate a balanced strategy that not only capitalises on the unique selling points concerning location but also addresses the fundamental mechanics of guest comfort and service.
5.2 Subtask 2: pairwise comparisons between texts using word shifts graphs Using Shifterator 0.3.0(Gallagher et al., 2021), the estimated benchmark scores distinguish between different regimes of interest in word scores.In particular, the Shifterator makes it possible to identify which words explain the most variation between texts (reference and comparison categories) and visualise pairwise comparisons using word shifts.Furthermore, it allows us to know the score of each word in terms of its use in each text and, likewise, to know

Tourism decisionmaking
qualitatively whether the word is relatively positive or negative.In summary, our study quantifies which words contribute to the differences between two texts and how they do so.By making these lexical shifts transparent and quantifiable, word shift graphs enable more grounded statistical analyses and enhance our understanding of how language varies across textual contexts.
In particular, our study constructs several word shift graphs with horizontal bar charts that provide word-level explanations of how and why two texts in each category differ.Our study previously constructed a dictionary of words and each word is assigned a weight or score using weighted logarithmic odds values (above the 60th percentile), identifying 7,216 keywords with their scores from most dissatisfied (negative values) to most satisfied (positive values).It also avoids overlapping terms between polarities (positive and negative) according to their context of use.The weighted log-odds method is described in Monroe et al. (2008).It is a relevant approach for text analysis in that it accurately measures how word usage differs (and scores) in a comparative set of documents, in our case, "relatively positive/satisfactory" or "relatively negative/unsatisfactory" reviews.
A brief interpretative guide to the word shift graphs (Gallagher et al., 2021) showing the top fifty words contributing to the difference in satisfaction versus dissatisfaction between the categories compared is set out below (Gallagher et al., 2021): (1) A relatively positive word (þ) is used more frequently (↑) in the second text (COMP) (less in the first text, REF).
(2) A relatively positive word (þ) is used less frequently (↓) in the second text (more in the first text).
(3) A relatively negative word (À) is used more frequently (↑) in the second text (less in the first text).
(4) A relatively negative word (À) is used less frequently (↓) in the second text (more in the first text).
(5) If the contribution of a word is positive, δτ > 0 (that is, þ or À ↓), the bar points to the right, and if negative, δτ < 0 (i.e.þ or À ↑), the bar points to the left.
In summary, four different types of contribution (þÀ↑↓) are indicated by bars.A relatively positive and more frequent (compared to) word is characterised by a bright yellow bar on the right (þ↑), while a relatively negative and more frequent (compared to) word is characterised by a bright blue bar on the left (À↑).For our word shift graphs, our study sets a reference value of 0 (the centre of our dictionary scale) and applies a stop lens to dictionary words between À5 and 5.The graphs feature diagnostic plots of cumulative contribution and text size in the lower left and right corners, respectively.Specifically, the point at which the cumulative curve intersects the horizontal line signifies the proportion of the word shift difference accounted for by the most contributing terms (see Gallagher et al., 2021).Therefore, it is essential to consult it to determine the weight to be given to the interpretation based on the word-shift graph.The second diagram shows the relative size of the text in each corpus, measured by the number of tokens (here, words) used.5.2.1 Analysis by type of traveller.In the following, our study discusses the main results achieved by the type of traveller (Figure 5).The couple category is used as a reference category (REF) due to its highest number of reviews and valences are based on weighted logarithmic odds values or the strength of the link between words and the valence of reviews.
(1) The analysis of guest reviews reveals that those travelling as a family tend to have more negative experiences than those travelling as a couple due to the higher use of relatively negatively loaded words in the narratives of reviews (from families), such Tourism decisionmaking as room, floor, time, water, pay, or smell.In contrast, the couple category employs more positive words, for example, location, walk, view, terrace, or city.On the other hand, guests travelling as a family tend to use relatively positive words that partially offset the negativity noted earlier; e.g.staff, breakfast, or shop -associated with the hotel's quality of service and amenities.
The relative total of each type of contribution (positive or negative) is shown at the top of the word shifts graph, which allows for a clear comparison between the different sentiments expressed by guests travelling as a family and as a couple.
(2) The analysis of guest reviews reveals that those travelling in groups have more favourable experiences than guests travelling as a couple due to the increased use of positive words, such as staff, location, amaze, or rooftop, associated with the quality of service provided.In addition, guests travelling in groups tend to use fewer negative words (e.g.bite, room, night, hear, or window) and more negative words (e.g.water, book, air-conditioning, pay, or tell) related to room amenities or service issues.
(3) Travelling alone has more negative experiences than travelling as a couple.This difference in sentiment is likely due to the greater use of relatively negatively loaded words in review narratives, e.g.room, hear, noise, people, or window.Likewise, the lower use of positive words, such as location, breakfast, staff, view, or city, is also worth noting.It could suggest that solo travellers highlight negative problems with the comfort of their accommodation and nearby amenities.
Finally, looking at the point where the cumulative curve intersects the horizontal cut-off line (see cumulative contribution diagram in the bottom left corner), the first ten words explain around 50% of the difference between the two texts.5.2.2 Analysis by length of stay.Our study also discusses the main results achieved by the type of traveller (Figure 6).The reference category (REF) is a guest whose stay is equal to one night.
Guests staying two nights or more tend to use more negative words, such as water, floor, window, or door, suggesting complaints about the room's amenities or service.One possible explanation may be that guests who stay longer tend to be more critical of their hotel stay, foster higher expectations, and, as a result, likely question issues related to the amenities and services the hotel provides associated with water, floor, window or door.
Hotels should, therefore, strive to balance addressing negative issues and emphasising positive aspects to ensure that guests staying for longer periods have a pleasant stay, providing excellent service, being attentive to guests' needs and offering a variety of amenities.Additionally, hotels should consider that guests staying longer may have higher expectations and adjust their services accordingly.
The relative total of each type of contribution is shown at the top of the word shift graph, allowing a clear comparison between the different sentiments expressed by guests staying for different periods.In general, hotels should strive to balance addressing negative issues and highlighting positive aspects to ensure that guests staying longer periods have a pleasant stay.Finally, looking at the point where the cumulative curve intersects the horizontal cut-off line, the first ten words explain around 50% of the difference between the texts.
According to previous results, our methodology allows for the quantification of the cumulative contribution of words, offering a valuable parameter for interpretation.It is useful in customer experience management, allowing the development of targeted hospitality strategies and specialised in-stay experiences.Additionally, it helps to allocate focus and resources to areas that need improvement, as identified by significant terms in negative reviews.

Tourism decisionmaking
In particular, sentiment allocation by traveller is a notable feature of our method.Figure 5 provides additional information on the experiences of varying types of travellers and lengths of stay.Therefore, our method serves as a diagnostic tool that highlights problem areas; also suggests targeted solutions for hotel management, enriching the base for tactical and strategic decision making.In fact, our observed trends in guest reviews in different travel categories (families, couples, groups and solo travellers) reveal intricate dynamics of expectations, experiences and expressed sentiments.Below are key interpretations for each type (COMP) compared to couples (REF): (1) Families: The greater prevalence of negatively loaded words in family reviews might suggest that families are more critical or have higher expectations regarding room quality and stay (e.g.such as room, floor, time, water, pay, or smell, among others).
Dissatisfaction could be rooted in the challenges associated with accommodating multiple people with varying needs, which could lead to increased focus on shortcomings.However, the use of positive words related to service quality (e.g.staff, breakfast, or shop, among others) partially offsets this negativity, indicating that family-friendly services could mitigate some of the perceived shortcomings.
(2) Groups: Interestingly, the groups display favourable experiences.It suggests a social amplification of positive sentiment.Groups focus less on negative issues and find social interaction to compensate for any service-related or amenity-based shortcomings.The groups seem to have general criticism related to room amenities or specific service issues, showing aligned expectations as a collective.
(3) Solo travellers: Negative experiences among solo travellers could be interpreted in various ways.Focussing on noise, windows, or floor, among others, could suggest an emphasis on privacy and personal space, which could be lacking.Solo travellers may feel less distracted by social interactions, leading to greater awareness of any shortcomings in their accommodation.
In conclusion, our research underscores the need for specific segmented customer experience management strategies.They offer valuable quantitative and qualitative insights that can greatly inform hotel management decisions.By adapting services and preemptively addressing customer needs based on these insights, significant improvements in satisfaction scores and fiscal viability can be expected.

Subtask 3: modelling of topics
Visualisation-based analyses provide a meaningful and interpretable summary of how individual terms contribute to cross-text variation and are helpful in knowledge extraction.However, identifying the rational and experiential (latent) topics in the corpus, and their visualisations, is also a fundamental approach to understanding the proper context of guests' opinions.
In this regard, topic modelling is a valuable tool in natural language processing and text mining, which groups similar words to identify patterns in a collection of text documents and latent topics in the corpus, even when they are not explicitly mentioned.In this regard, the generated topics are interpretable, and the terms associated with each topic highlight the meaning of each.Therefore, the results could be used for text classification, document summation, or building recommendation systems.
Next, our research estimates the relationships between terms and documents through a text-mining algorithm to discover hidden semantic structures (topics) in our dataset.
In particular, our research analyses natural and unstructured narratives using machine learning algorithms based on text summarisation and the application of NMF (cf. Lee and Seung, 1999).Our study applies topic analysis to normalised reviews using the NMF method implemented in sklearn (scikit-learn developers, 2020).NMF factors high-dimensional vectors (in our case, the TF-IDF matrix of M documents and N terms or words) into a lowerdimensional representation.The lower-dimensional vectors are non-negative, and their coefficients are also non-negative.In essence, from a matrix of documents (revisions) by words (A), the NMF application generates two matrices, i.e. (1) the matrix W (topics 3 terms) and (2) the coefficient matrix H (documents x topics).Our analysis excludes terms that appear in less than 100 documents (min_df) or more than 95% (max_df) of the documents.NMF shows higher coherence levels than Latent Dirichlet allocation (LDA).Coherence levels measure the similarity of meaning between the critical terms in topic i. Coherence helps to differentiate between topics that are interpretable and topics that are merely artefacts resulting from the applied statistical technique.Using the coherence score c_v (more accurate than u_mass), our study runs the model for a different number of topics (from 10 to 100 topics) and selects the number of topics with the highest coherence score.Following this procedure, the recommended number of topics is equal to 25a number from which coherence starts to decrease.Table 1 provides a word-by-word description of each topic and an illustrative example of a narrative for each topic.
Moreover, selecting a relatively small set of high-probability words per topic is advisable, since reviews tend to focus on key aspects of a particular stay.Rather than extracting extensive vocabulary, limiting to approximately 10 words per theme retains interpretability whilst minimising extraneous content.Although dependent on the dataset, this constrained lexicon likely encapsulates the essence of each topic without redundancy.In Table 1, topics and associated terms should be manually examined to validate consistency and relevance.
Similarly, our study applies an XGBoost classification algorithm, trained with the H-matrix of documents per topic and previously transformed by the natural logarithm of the values.Eighty per cent of the documents allow us to train the model.The remaining 20% will enable us to validate the model's performance and ensure that the model reliably predicts future observations.Our analysis uses StratifiedKFold and GridSearchCV from scikit-learn to select the best parameters for the XGBoost function.StratifiedKFold is a variation of KFold that returns stratified folds which preserve the percentage of samples for each class, with five-folds.GridSearchCV allows testing a range of parameters.Once both functions are applied, the best hyperparameters are: {'colsample_bytree': 0.4, 'gamma': 0.5, 'max_depth': 8, 'min_child_weight': 5, 'subsample': 0.8}.
Finally, our analysis further estimates the confusion matrix (Table 2) and the Matthews correlation coefficient (MCC) to validate the research model.The MCC is a statistical measure of the quality of binary (two-class) classifications.It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.In our analysis, the MCC value equals 0.70.
Second, our study identifies key characteristics related to hotel stays using the SHAP values for each topic for each document (SHapley Additive exPlanations, hereafter SHAP; see Lundberg and Lee, 2017).The SHAP values describe how each topic in our model contributes to increasing (or decreasing) positive or negative levels.Here, SHapley Additive exPlanations (SHAP) systematically uncover the crucial elements that govern guest satisfaction during hotel stays.Consequently, to clarify the hierarchy of influential variables, Figure 7 has been designed to reveal the factors that contribute positively or negatively to the overall guest experience.
In particular, Figure 7b allows us to visualise the importance of characteristics and their impact on the prediction by plotting summary charts.That is.
(1) The Y-axis indicates the feature names in importance order from top to bottom.

Tourism decisionmaking
(2) The X-axis represents the SHAP value, which indicates the degree of change in model outputs.
(3) The colour of each point on the graph represents the value of the corresponding feature, with red indicating high values and blue indicating low values.
(4) Each point represents a row of data from the original data set.

MD
For example, Topic 0 emerges as the most vital, including a range of factors including the quality of staff interactions and aesthetic appeal of the facilities.Topic 0 increases the predicted output, that is, guest satisfaction.On the contrary, variables such as Topic 6 and Topic 18, which deal with noise and comfort issues, as well as effectiveness of problem resolution, are found to harm guest satisfaction.
In this regard, the topics with the highest (to lowest) degree of importance are the following: (1) Topic 0 (staff and facility aesthetics) (þ) is related to elements of a hotel experience, including staff, language, furnishings, decoration, amenities, temporary stays, membership, ownership, location and design.
(2) Topic 6 (noise and comfort concerns) (À) highlights the various equipment, services and characteristics provided in a hotel room to make the stay more comfortable and convenient for guests.
(3) Topic 18 (guest concerns and resolutions) (À) is related to problems the guest may experience during their stay, e.g.maintenance work being done at the hotel.
(4) Topic 1 (walking distance to attractions) (þ) focuses on walking and exploring a specific location, such as an attraction or sightseeing spot, as well as nearby places such as plazas or shops.
(5) Topic 7 (centrally located to explore city) (þ) concentrates on exploring a city and its central area, such as the city centre or heart and how to reach it by walking or taking, for example, a taxi.
(6) Topic 17 (charming old town and location) (þ) focuses on the city and its characteristics, accessibility and visits.
(7) Topic 24 (aesthetically pleasing design) (þ) is related to the design and ambience of a hotel, precisely the outdoor spaces and the property's aesthetic appeal.
(8) Topic 12 (overall guest experience and attributes) (þ) highlights luxury and exclusivity and the experiences while staying in it.
Finally, to better understand guests in terms of topics and structural features, our analysis applies the K-Prototype approach to classifying similar guests into the same group.The variables used in the clustering for mixed data (numerical and categorical) are those shown in the table accompanying Table 3.In addition to the length of stay, type of traveller, or type of review, our study also includes the eight topics with the highest contribution to prediction (see Figure 7a); the scores are normalised.Our analysis also balances and intersects the different classes of the categorical variables to avoid over-dimensioning the classes.Eight clusters are proposed using the elbow method.The elbow method calculates the total variance of the clusters (cost) for a number from 2 to n.As the number of groups increases, the total variance of the groups should decrease.The elbow method proposes that the number of groups in which additional groups do not produce a significant decrease in total variance is the number of groups to extract (here, 8 groups or clusters).
The UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction technique is used to represent the data in 2 dimensions (Figure 8).Three steps are taken to obtain the embeddings: (1) the Yeo-Johnson transformation is applied for numerical variables and one hot encode for categorical variables; (2) UMAP is applied separately to each type of variable and (3) the obtained embeddings are combined.In particular, Figure 8

MD
space.Clustering phenomena within the map often point towards inherent groupings or classes present in the data, while point density offers insights into the prevalence of certain characteristics-not likely to be a spurious group.The topological aspects of UMAP maps thus provide additional layers of data interpretation, particularly in understanding gradual transitions between data clusters or groups.
In our research, although K-Prototypes clustering provides distinctive clusters, the clusters or segments are distributed with clear boundaries and specific clusters appear in different areas of the scatterplot.This is probably a consequence of suboptimal embeddings.
In summary, the analytical techniques adopted in our study provide a comprehensive yet nuanced understanding of the guest experience within hotels.SHAP identifies the pivotal factors that contribute to both positive and negative aspects of customer experience, thus guiding managers in crafting more targeted service strategies.Simultaneously, UMAP and K-Prototype facilitate the creation and visualisation of guest clusters, enabling service customisation at a deeper level.Our collective insights are indispensable for hotel managers seeking excellence in service, ultimately contributing to increased profitability.
In addition, our analysis calculates the mean value of each topic per cluster to check whether the mean allows one to evaluate the topics' significance in each cluster.Then the variance of the means between the clusters is calculated for each topic.This allows for the selection of the main topics per cluster.Figure 9 shows the differences per group.Values are scaled between 0 and 1 for ease of visualisation.The extracted clusters provide information on the various aspects of hotel stays that are most important to guests and can help hotel management improve guest satisfaction.That is.
(1) The first group comprises guests travelling alone and hotels staying between 2 and 3 nights.The critical topic associated with this group is topic 6, which is related to the (2) The second group consists of guests travelling as a couple and staying between 2 and 3 nights.The narratives in this cluster tend to be positive, with the most relevant topic being topic 1, which relates to the location and surroundings of the hotel, specifically the hotel's proximity to various tourist attractions, places of interest and shopping areas.To a lesser extent, topic 7 -related to the location and accessibility of the hotel in a city or urban area-is also relevant.Additionally, topic 0 -on the hotel facilities and services-is also discussed.
To sum up, the second group comprises the various features of hotel amenities, services and characteristics, including the hotel's proximity to points of interest, accessibility, ease of reaching the hotel by foot or public transportation, staff fluency in multiple languages, front desk service, decor and facilities, as well as the overall guest experience during their stay.
(3) The third group of guest experiences comprises guests travelling in groups and staying for one night.The overall sentiment expressed is positive, and the most relevant topic discussed is topic 0 related to the hotel's administration.Topic 0 covers various aspects of the hotel, such as the staff, their level of fluency in different languages, the front desk, the decor, the facilities, the visit, the members, the property, the site and the decoration.
(4) The fourth group is composed of guests travelling as a family and guests booking an overnight stay.The narratives in this group are primarily negative, describing issues related to topic 18. Topic 18 refers to problems that guests may experience during their stay, e.g.maintenance work at the hotel.In addition, topic 6 relates to the various equipment, services and features provided in rooms to make the stay more comfortable and convenient.Topic 6 is discussed to a lesser extent.
In particular, guest reviews indicate various issues related to the physical characteristics of the hotel room, such as the discomfort associated with the bed, lack of adequate air conditioning, outside noise or that of other guests, and issues with flooring, walls and windows.These factors can negatively impact the overall experience and sleep comfort.Although these physical characteristics of the hotel have previously been highlighted for stays of more than one night (group 1), in this case, they are also relevant for guests travelling as a family (with children).(5) The fifth group comprises solo guests who comment on positive experiences during their stay -for one night.The most relevant topic discussed is topic 7. Topic 7 comprises issues such as the hotel's proximity to the city centre, a location that allows exploring the heart of the city easily, the ease of reaching the hotel on foot or by public transportation, or the short distance to the hotel by taxi.In summary, the fifth group values the ease of reaching the city's main attractions and enjoying the city centre without transport problems.
(6) The sixth group consists of guests travelling as a family for 2 or 3 nights.The most relevant topic discussed is topic 17 associated with the location, characteristics and accessibility of a hotel in a town; for instance, the hotel's proximity to the city centre, its ability to locate and visit charming sites, the hotel's architectural design, the courtyard, the charm of the visit, the central location, the ease of access and the proximity to any points of interest in the town, such as a bridge or a cathedral.
Therefore, the sixth group values exploring attractive city sites and having a comfortable stay with easy access to amenities and points of interest in the area.
(7) The seventh group is identified as guests travelling as a family for one night and reporting mainly negative aspects.However, no specific topic stands out in their reviews.
(8) The eighth group is made up of families staying overnight.They narrate positive aspects that revolve around topic 12, which relates to the luxury and exclusivity of a hotel and the guest's experience while staying in it.This topic encompasses various aspects, such as being amazed by the hotel's experience, the rooftop, related to an outdoor area on the top of a building, often used as a recreational space or for events, the design, the property, the staff, the construction, the stunningness, the suite and the visit.For example, a family orders suite rooms to reduce lodging costs.Also, topic 0 is discussed.
In summary, hotel managers must pay attention to the aesthetic and functional elements, the physical characteristics and maintenance of the establishment, the level of service provided by the personnel, the quality of the building and its infrastructure, the level of surprise and satisfaction generated, the unique rooms and amenities offered, and the overall guest experience.Therefore, this group of guests appears to value the luxury and exclusivity of the hotel, specifically the hotel design and rooftop experience, which left them amazed.They also praise the hotel staff and the hotel maintenance level.The hotel suite and the visit also stood out as positive aspects.

Conclusions
Our study underscores the paramount importance of considering a holistic and customercentric approach to improve guest satisfaction in the hotel industry.Drawing on commitment theory, our research identifies relational quality, location and accessibility, service differentiation and standardisation as key facets that significantly impact guest satisfaction.This understanding offers valuable information for hotel managers to craft services that resonate with guest expectations and foster trust and commitment.
Our research presents a comprehensive method for analysing guest reviews using advanced data mining and machine learning techniques.It (1) aims to identify key characteristics and themes that guests highlight during their hotel stays, (2) visually explores the relationships between these characteristics and differences between different types of travellers through online hotel reviews and (3) determines predictive power.Its implications are crucial for the hospitality domain, as they provide real-time insights into guests' Tourism decisionmaking perceptions and business performance and are essential for making informed decisions and staying competitive.Furthermore, as stated in the Andalusian Regional Government Strategic Tourism Marketing Plan (2020), obtaining this information promptly is critical to the success of hotel establishments.
However, with the abundance of online reviews available, it can be overwhelming for managers to analyse them effectively.Our method thus helps to organise and distil this massive amount of data, making it easier to understand.Furthermore, using data mining, machine learning techniques and visual approaches, researchers and managers can extract valuable insights (on guests' preferences) and convert them into strategic thinking based on exploration and predictive analysis.In summary, by providing practical implications for guest perceptions, our study suggests that different types of guests present differences in hotel key factors.Consequently, it aims to help hotel managers in making informed decisions, thus improving the overall guest experience and increasing competitiveness.

Implications
Our work pursues an exploratory purpose (Rigdon et al., 2017).It designs computational experiments and analyses complex and uncertain systems.As a result, it helps (1) to gain and extend knowledge and understanding of the phenomenon described and (2) to better pinpoint the problem to be investigated, that is, the organisation of the information provided by the data, detection of patterns of behaviour and determination of topics (or semantic structures) and their relationships with the phenomenon in question.Therefore, the proposed research encourages further theory-building by applying an inductive reasoning perspective (Henseler, 2018).
Based on the designed method, our study develops a prototype proposal in the tourism domain that visualises the advantages of analysing UGC.In particular, topic modelling, prediction, classification and visualising natural language processing results are employed in our research to facilitate the preprocessing of guest online reviews and the subsequent analysis of data from various academic disciplines and research areas.Therefore, our key implication lies in proposing an approach (a) that can find and model latent knowledge that researchers find difficult to observe by exploring large volumes of data and (b) that helps in the decision-making process in the hotel sector.In this context, integrating multiple data sources (structured and unstructured) based on guest feedback and their subsequent compression and value provision recommends ways of doing that are not possible with traditional single-discipline approaches.Moreover, as Rong et al. (2012) point out, data mining emerges precisely as a valuable method to help business managers achieve the goals of efficient user relationship management.
In short, the field of information and communication technologies and tourism generates growing academic, technical and business opportunities and challenges when approached together.For example, the application of data mining to the tourism sector, particularly to hotel establishments, increases the chance of meeting demand with less information asymmetry.Furthermore, increased hospitality competitiveness creates satisfaction and loyalty-specific objectives to improve international positioning and provide stability to the destination (tourists who repeat and recommend their stay; see Camp on-Cerro et al., 2015a, b;Hern andez Mogoll on et al., 2013).Therefore, the level of relational quality in their accommodation in Andalusia is a determining factor in creating, maintaining and intensifying the expected loyalty.Nevertheless, although published assessments and perceptions of quality and excellence are based on their experience (during their stay), it is necessary to go deeper into the printed text and analyse its latent structure from a demand perspective to ensure a satisfactory and emotionally rewarding visit.

MD
Consequently, our research contributes to the literature by revealing the vital features of hospitality experiences and hotels perceived (by guests) and how to improve guest experiences.From the logic of sufficiency, this question is in-depth using a vast amount of natural and unstructured UGC to provide insights that may not be obtained through conventional methods.Therefore, our research proposes an integration framework for the content provided by guest comments concerning accommodation to extract which hospitality characteristics are equally or differentially relevant.Our study contributes to the literature on hospitality services, offering insight for practice and allowing the design of guest (dis) satisfaction policies and a more fine-grained understanding of hospitality services.Furthermore, the results reported in this investigation contribute to the debate about whether hotels and other alternative tourist accommodations compete concerning the main accommodation conditions.
The main implications of our study are thus as follows.First, hotel managers should improve the services (improving rooms and rest).In particular, the hotel should take preventive measures against uncomfortable issues by reducing noise in rooms and corridors and increasing circulation to reduce odours.Second, hotel maintenance should be a priority, ensuring that the hotel is clean, tidy and in good condition.Likewise, hotel rooms' equipment, services and features are crucial for guest comfort and convenience and to avoid guest complaints.Third, guests highlight the hotel's location and the surrounding interests, its proximity to tourist attractions, places of interest and shopping areas.Therefore, the hotel should highlight gastronomic and cultural offers close to the hotel (stressing these offers).Fourth, the design and ambience of the hotel, hedonism and exclusivity, e.g.outdoor spaces and aesthetic facilities, are crucial to guests.In particular, our findings also suggest that hotels should (1) address (negative) issues related to family rooms, such as room size, odours and water quality, and (2) highlight the location and nearby amenities (neighbourhood) for guests travelling alone.Similarly, managers should adjust promotional efforts to meet guests' needs, for example, by providing excellent services by friendly, helpful and committed staff (responsible for taking reservations, cleaning rooms, planning parties and maintaining the building) or amenities to families (e.g. a good breakfast and stores) and compensating for negative experiences.Fifth, managers must pay attention to the features and facilities related to the comfort of rooms for guests travelling alone (e.g.room size, noise levels and window quality, that is, sleep).Finally, guests travelling in groups (compared to couples) use positive words (in their narratives) such as staff, location, amaze or rooftop.In contrast, they tend to use more negative words, such as water, book, air conditioning, pay, etc., which may indicate problems with the hotel's amenities or in-room facilities.
In summary, online platforms represent a two-way channel for producing and consuming information and co-creating experiences.Our project analyses the usefulness of data mining techniques as an intelligent tool to enhance the image of hotel establishments and, by extension, of Andalusia as an international tourist destination.Guest reviews or comments expressed in natural language allow customers to describe their experiences with hotel services.In other words, our study does not address the analysis of quantitative variables collected by designing a structured questionnaire.In contrast, it identifies performance issues that are subtle (and even hidden), challenging to diagnose and damaging to the hotel's reputation if not delved into by various disciplines with consistent teams.By natural language processing techniques and topic extraction (unsupervised learning model), our analysis confirms with higher precision and richness of data the results of the published literature on the components contributing to improving relational quality levels.Therefore, our results should be even more reliable and valid than the statistical results discussed in the work based solely on customer ratings (customer satisfaction) and perceptions obtained from satisfaction surveys on small samples of customers.

Tourism decisionmaking
Finally, future research should further investigate the bias towards mostly positive reviews using a scale based, for example, on stars awarded.In addition, it should include a more significant number of guest descriptor variables, such as cultural script or demographic characteristics (e.g.age or income).Such variables may affect their stay evaluations.Research should also study other destinations with different personalities and backgrounds and seasonality patterns of tourists or tourist attractions to generalise the conclusions.Furthermore, Sainaghi and Baggio (2020) acknowledged that one of the main drawbacks is the complexity of differentiating the segments of business travellers on the one hand and leisure guests on the other.Future studies should focus on hotels located in the city centre and around transport hubs.
Accordingly, a significant limitation is also the potential for bias in big data.For example, data collected from online sources may be biased towards specific demographics, for instance, those who are more likely to be active on social networks.Therefore, data collection methods can introduce bias, such as self-selection or nonresponse bias, leading to inaccurate or unfair conclusions.In summary, it is essential to carefully consider the sources and methods of data collection and apply appropriate techniques to account for potential biases in the data, use a diverse set of data sources and be transparent about the limitations of the data to ensure that insights and conclusions are reliable and valid.
Figure 1.Example of positive and negative reviews on Booking.com

Figure 2 .
Figure 2. Distinctive terms in each review category (SATISF vs DISSAT) by frequency

Figure 5 .
Figure 5. Basic word shifts: types of travellers

Figure 6 .
Figure 6.Basic word shifts: length of stay visually shows the quality of the clusters or segments.The spatial distance between the data points serves as an initial indicator of similarity or dissimilarity in the original high-dimensional feature Note(s): * Class sizes have been balanced and intersected for analysis to avoid over-dimensioning the classes Source(s): Figure 8. Two-dimensional graphical representation using UMAP

Figure 9 .
Figure 9. Box plot of numerical variables by topic

Table 2 .
Table created by the authors Table 1.Note(s): 0: a positive review; 1: a negative review Source(s): Table created by the authors Confusion matrix Table created by the authors