Abstract
Contemporary consumer behavior is characterized by its multidimensionality and complexity, which, at the same time, pushes traditional segmentation approaches to their limits. In response, this methodological study proposes a multistage machine learning-based segmentation process using semiotic-semantic community detection. This innovative method was conducted exemplarily and evaluated on a representative sample of 1,101 German travelers. The main contribution of this study lies in the novel use of word vectors, which result from assigning a semiotic meaning to travel-type images. Thus, high-dimensional data could be used during the segmentation process, overcoming several classical segmentation problems. By using semantic similarities, tourists could be grouped and represented in their multidimensionality. From a theoretical perspective, this study was inspired by postmodern tourism practices in order to better understand the (oftentimes) hybrid and multilayered behaviors of tourists. To make this innovative approach reproducible, recommendations for implementation and all necessary data have been provided.
Introduction
The analysis of tourists’ behavior has always been, and continues to be, one of the most central topics in tourism research (Kozak, 2018). Scholars have thus widely agreed upon the fact that tourists are nowhere close to forming a homogeneous group (Albayrak & Caber, 2018; Dolnicar & Grün, 2008) as they have different needs, requirements, and expectations. As a result, since the mid-1970s, numerous attempts have been made to develop typologies (Bailey, 2003; E. Cohen, 1972), define tourist roles (Foo et al., 2004; Lowyck et al., 1992), create taxonomies (Mapingure & du Plessis, 2022; Shapley, 2018), and identify market segments (D’Urso et al., 2016; McKercher et al., 2023; Srihadi et al., 2016) that describe consistent subgroups. This enabled tourism products to be tailored according to a market segment’s characteristics, needs, and values (Dolnicar et al., 2012; Srihadi et al., 2016) while also simplifying tourism planning, marketing, and management (Dimitrios, 2022; Mariani et al., 2014). However, over the past years, with consumer behavior trends rapidly and continuously transforming (D’Urso et al., 2016), it has become increasingly difficult to apply segmentation procedures.
Social change, driven by inequalities in new income and wealth, rising leisure budgets, changing leisure and consumer behaviors, dynamic population developments, and reconfigured family and household structures (Egger, 2007), has promoted the individualization and pluralization of lifestyles (Al-Khanbashi, 2020), ultimately leading to postmodern traits (Deli-Gray & Árva, 2015) that are also reflected in travel behaviors (Dujmović & Vitasović, 2015; Uriely, 2005; Wright, 2022). From the mid-1990s onward, numerous papers have shifted their focus to tourism in the postmodern era, discussing, in particular, the changing behaviors of an increasingly complex and complicated tourist (A. A. Berger, 2011; Dujmović & Vitasović, 2015; D’Urso et al., 2014; Maoz & Bekerman, 2010; Postmodern Tourism-alternative Approaches, 2006; Zotic et al., 2014). This means that the presence of diametrically opposed behaviors can no longer be considered an uncommon trend when traveling; for instance, tourists who enjoy the excellent food and extensive spa offerings of a five-star hotel may spend their next vacation in a self-catering cottage or camping site. Hybrid consumers (Ehrnrooth & Gronroos, 2013) as such progressively challenge the proper use of marketing activities.
In recent decades, an array of tourism studies has been devoted to the problem of segmentation and the creation of typologies and taxonomies, with the number of articles published each year confirming the irrevocable popularity of segmentation attempts. A study by Zins (2008) revealed that approximately 5% of tourism studies are related to the topic of market segmentation, and Dolnicar (2006) argues that the way segmentation approaches are carried out has remained fundamentally consistent over time. Moreover, though clustering approaches have dominated marketing and tourism literature (Dolnicar, 2008, 2020; Dolnicar & Lazarevski, 2009), D’Urso et al. (2016) question the use of classical algorithms—mainly because of the dissimilarities between modern and postmodern tourist behaviors—and push for new approaches.
While previous studies have mainly attempted to include sociodemographic, geographic, psychographic, or behavioral variables (Dolnicar et al., 2018; Johns & Gyimóthy, 2002), they ultimately result in quintessential classifications. This is problematic, however, as tourists no longer conform to clear-cut and predefined groups. At this point, it should be noted that the methodological approach proposed in this paper is not limited to segmenting tourists with postmodern characteristics; rather, placing a theoretical and empirical focus on the multi-optional behavior of hybrid tourists was done deliberately.
From a methodological standpoint, classical segmentation approaches often lead to problems in representing multidimensional, partly paradoxical, and contradictory consumer behavior, which is why innovative methods that are able to model the complexity of reality are urgently needed (Bigné & Decrop, 2019; Mehmetoglu, 2004). Since this is considered one of the greatest challenges of segmentation approaches, advantages of the presented methodology can be highlighted.
To summarize, the following methodological research questions were developed:
How is it most optimal to model the multidimensionality of hybrid tourists?
How can travel behavior be clustered in such a way so that the similarity between extracted clusters and between individuals can be measured and the distances between each other can be calculated?
Which methodological approach would meet both requirements?
By proposing a multistep machine learning-based process, in which respondents are clustered based on their travel preferences, the present study hopes to make a valuable methodological contribution. The following advantages can thus be noted: First, this approach uses images to assess the travel motives of the respondents, which makes it easier for them to express their needs and desires (Sertkan et al., 2020b). Second, tourist types usually coexist on an equal level as hardly any similarity measures reflecting realistic distances between identified types are available. In this study, however, and for the first time, a similarity measure is ensured by making use of the semiotic-semantic similarity generated from the annotations of the assessment images. Third, representing the concrete novelty of this approach, the use of weighted word vectors allows for all the information on the complexity of tourist behavior to be preserved in a high-dimensional vector space without being averaged out.
Addressing the theoretical side, this paper bridges customer segmentation with the theory of postmodernity and questions (1) to what extent individualization and pluralization tendencies play a role in today’s society and (2) whether the multi-optionality that exists renders the description and division of tourists into individual homogeneous groups obsolete. On that note, Dujmović and Vitasović (2015) have recognized a postmodern shift in theoretical tourism research, with scholars moving away from tourism typologies and focusing more so on the individual. E. Cohen (2019) and Boztug et al. (2015) also rightly wonder whether tourism typologies can even still be seen as relevant in a postmodern world where all the dichotomies of modernity (e.g., subject-object, self-world, analyze-experience, real-ideal) are gradually being broken down and disregarded.
The counterargument that follows, which will be thoroughly investigated in this paper, claims that tourists’ individually chosen paths eventually become bundled up in a “street of ants” (Keul & Kühberger, 1996). Consequently, individualization theory can no longer be understood as the heroization of the individual. In short, it is a matter of examining whether the theoretically infinite possibilities of differentiation truly are that diverse or whether they are ultimately finite and overarching multidimensional patterns can be identified.
As for the study itself, the described approach has been applied and evaluated based on 1,101 German travelers and has proven that numerous methodological problems in classical clustering approaches can be overcome. Hence, contemporary tourist behavior can in fact be represented in its complexity. A similarity measure that allows for the measuring of distances between the individual users and the extracted typologies will thus be introduced in this paper, and, owing to a network approach, an intrinsic evaluation metric will be provided. The results achieved make a valuable contribution to the theoretical discussion on contemporary tourism behavior and provide a better understanding of the current challenges of customer segmentation.
Literature Review
Theory of Postmodernity and Tourism
Tourism is constituted as a social phenomenon, and Hall (2004) emphasizes that tourism and its processes change in parallel with societal developments. A heated debate in recent decades has therefore been the concept of postmodernism: a complex and multifaceted philosophical and cultural movement that emerged in the second half of the 20th century and has provoked fundamental changes in society and its thinking. Postmodernity, also called “fluid modernity” (Bauman, 2000) or “second modernity” (Beck, 2016), is a cultural paradigm (Thouki, 2019) and type of social consciousness (Yi et al., 2018) that is thoroughly diffuse and characterized by uncertainty (D. Wang et al., 2015). Notwithstanding the philosophical debate surrounding postmodernity and its validity and fruitfulness per se, it is nevertheless associated with numerous characteristics needed to understand this work and demonstrate the power of the semiotic-semantic community detection approach.
Modernity, as a precursor of postmodernity, is a historical epoch that developed throughout Europe in the 18th and 19th centuries and was defined by a variety of changes in the fields of economy, technology, art, and culture (Bauman, 2000; Habermas & Ben-Habib, 1981). Not only was modernism characterized by the rise of nations and industries but also by the spread of science and technology and the emergence of new art and cultural movements (Appelrouth & Edles, 2016). Furthermore, it embodied a belief in progress and rationality as well as the idea of it being possible to understand the world through objective analysis and reasoning (Kim et al., 2021). In contrast, postmodernism is not viewed as a historical era (Lyotard & Brugger, 2001); it is a complex social and cultural concept that has evolved over the years by critiquing the basic assumptions of modernity, especially the notions of progress, rationality, and knowledge. Postmodernism emphasizes the fragmentation and plurality of reality and refuses to accept a unified worldview (Taylor, 2003). This leads to an increased focus on the individual’s perspective and the importance of experience and interpretation (Arias & Acebrón, 2001). Denzin (1998) describes postmodern societies as more willing to compromise and follow a “both/and” logic rather than an “either/or” logic. Therefore, combining elements that initially seem incompatible, and are sometimes even contradictory, is inherent to postmodernism (Gao et al., 2022; Vester, 1999). For this reason, it is assumed that Western society in particular is too complex and too heterogeneous to be explained and described using traditional binaries (E. Cohen & Cohen, 2012).
Postmodern societies are consumer societies that encourage an eclectic lifestyle, and tourism can also be understood as a form of consumption (Sharpley, 2022), promoting multi-optional behaviors as tourists assemble and consume a treasury of experiences. According to (Lash & Urry, 1993), consumption patterns have changed from mass consumption to individualized buying habits in which Poon (1993) notes the technically supported, necessary change from standardized tourist products to individualized and personalized offers. In addition, S. A. Cohen and Cohen (2019) emphasize the need for a pluralizing conceptualization of tourism (Edensor, 1998) in order to do its heterogeneity justice. Note, however, that tourism research has only recently adopted the theoretical perspectives necessary for this.
In recent years, shifts in travel behavior, new forms of tourism, and a general structural change in the tourism industry could be observed (Mustonen, 2006). Drivers of such trends include globalization and digitalization, phenomena such as the climate crisis, the COVID-19 pandemic, wars, natural disasters, and the postmodern conceptualization of tourist experiences, amongst others (McCabe, 2005; Uriely, 2005; Urry & Jonas, 2011). Postmodernism is often used as a concept in tourism literature to describe the hybrid behavior of tourists (Boztug et al., 2015; D’Urso et al., 2016, 2019) or to characterize various forms of travel, including pilgrimage tourism (Collins-Kreiner, 2010; Thomas et al., 2018), voluntourism (Müller & Scheffer, 2019; Mustonen, 2006), digital nomad tourism (Mancinelli, 2020), or backpacking (E. Cohen, 2003). Also topics such as authenticity (MacCannell, 1973; Yi et al., 2018; Zhu et al., 2023), multisensory experiences (Everett, 2009), social media usage (Jansson, 2018), smart tourism (Tribe & Mkono, 2017), and urban spatial development (Richards, 1996), to name just a few, are widely disputed within the context of postmodernity. Finally, postmodernism is used to explain the increasing importance of sustainable tourism (A. A. Berger, 2011; Urry & Jonas, 2011) or to discuss its relationship to sightseeing and heritage tourism (Nuryanti, 1996; D. Wang et al., 2015).
According to McCabe (2002), in postmodern society, tourism has become a constant part of life routines and everyday culture; the importance of the individual has shifted, and attention has been alternatively drawn to the multilayered and multifaceted. The “new tourist” (Dujmović & Vitasović, 2015), also called “homo touristicus” (Chalmers, 2011; Herdin & Egger, 2018) or the “chameleon tourist” (Bigné & Decrop, 2019), shows postmodern (Dann, 2002; Urry & Jonas, 2011) and multi-optional (Hallmann et al., 2015) characteristics, does not adhere to social conventions, and consumes what he/she likes. For these consumers, hedonism replaces the search for the authentic (Ritzer & Liska, 2002). Whereas, formerly, travel motivations strictly defined one another (Popp, 2012), they now intermingle in a pastiche-like manner (Canavan, 2021; Vester, 1998), provoking tourists to switch erratically between different tourist experiences (Uriely, 1997)—even within the same trip (McCabe, 2010). For Bigné and Decrop (2019), today’s tourists pose a riddle to science and the tourism industry because they act unpredictably and irrationally, like an “omnivorous and insatiable” animal.
Concerning tourism marketing, several postmodern characteristics have caused upheavals and prompted necessary adjustments. Maclaran (2017) states that there has been “a sense that all things are disconnected,” accompanied by fragmentation (Firat & Shultz, 1997) and resulting in the decline of mass marketing as well as the creation of even smaller market segments. Reality has blended with the unreal or the hyperreal (Firat et al., 1995), producing new forms of experiences (J. Wang et al., 2022) in which passionate participation and co-creation (Egger et al., 2016) are foregrounded and passive consumption is displaced (E. Cohen, 2019; Dujmović & Vitasović, 2015). Lastly, postmodern pluralism has united all aspects into an “anything goes” syndrome (Maclaran, 2017), which, inspired by relativism, allows for multiple truths that are socially contrived.
For many years, the tourism and leisure industry has responded to altering patterns of demand by offering “individual” products as part of a seemingly unmanageable range of tourism products (O’Regan, 2014). Yet, while the challenge for marketing in the modern era was to tailor products and services to the individual, this task has become even more challenging in postmodern societies due to individuals no longer exhibiting unique behaviors (Boztug et al., 2015; D’Urso et al., 2016, 2019; Sibi et al., 2020). Moreover, the transformation from a seller’s to a buyer’s market has placed more emphasis on customers and fostered segmentation (Schlögl, 2017). Factors such as the spatial-temporal compression, the differentiation of tourist offers, and the variety and uniqueness of tourist experiences (Dujmović & Vitasović, 2015; Wearing et al., 2010) require a rejection of rigid typological frameworks and the introduction of new situational, motif-based, and more flexible segmentation approaches.
Vester (1998) has further validated the fact that such turmoil has already been taking place for many years, pointing out a paradox that is crucial for the continued argumentation of this study. The statement emphasizes that, on the one hand, tourists are viewed as multi-optional consumers; on the other hand, they ultimately always opt for the same thing and, therefore, regularly follow the same paths. Individualization is thus based on the belief in the unlimited possibility of choices. This is, however, essentially an illusion since actual choices can be reduced to a set of categories.
Tourist Typologies and Taxonomies
In principle, classifying members of a group into homogeneous subgroups can be referred to as segmentation (Dolnicar et al., 2018; D’Urso et al., 2019; Firat & Shultz, 1997; Marwijk & Taczanowska, 2006). With a conceptual approach, the grouping variables are generally known in advance (Dolnicar, 2002), and for any a priori segmentation, the characteristics that describe an individual must be defined beforehand by the researcher (Bigne et al., 2008; Boley & Nickerson, 2013; Mazanec, 2013). In contrast, data-driven segmentation approaches result in what is known as taxonomies (Dolnicar, 2008), a term that is often used interchangeably with typologies. These post hoc approaches are empirical in nature and fall back on quantitative techniques, especially cluster analysis (Dolnicar et al., 2018; D’Urso et al., 2021).
Typically, theory-based typologies find themselves at the focal point of a typological discussion, with empirical findings and modeling knowledge emerging as they continue to progress (Marwijk & Taczanowska, 2006). As a result, numerous typological attempts have been made over the years, with E. Cohen’s (1972) study (distinguishing between organized mass tourists, individual mass tourists, explorers, and drifters) often being chosen as a reference point. Main representatives in the 1990s and 2000s include, for instance, Crompton (1979), Uysal and Jurowski (1994), Fodness (1994), and Gretzel et al. (2004), amongst others. In most cases, typologies are diverse, ranging from bipolar distinctions, such as Gray (1970)“sunlust” versus “wanderlust,” to complex groupings with 30 various tourist roles, such as those in Jamrozy and Uysal (1994). What classic typologies have in common, though, is that they all attempt to group tourist experiences.
Contemporary approaches are seen as even more variable, pinpointing specific actions or aspects of tourism. This can be attributed to the advanced subject of this research field but also to the rise of postmodern behavior. For example, Fodness and Murray (1998) examined typologies on the basis of tourists’ information-seeking strategies, whereas Fan et al. (2017) based their tourism typologies on social contacts. Other studies have made attempts to classify a particular travel market or form of travel, as in T. E. Li and McKercher (2016), who examined typologies of diaspora tourism by grouping them according to cultural identity and place attachment, Wongkit and McKercher (2013), who created typologies for medical tourists, and Vong (2016), who studied types of cultural tourism in a gambling destination. Lastly, in their study, Gretzel et al. (2004) analyzed the efficacy and appropriateness of using personality traits to develop travel types.
On the other side of the spectrum, from a methodological perspective, empirical studies have mainly used cluster analysis, factor analysis, and latent class analysis to build stringent types of classifications (Baur & Blasiusk, 2014). Dolnicar et al. (2018) highlight the importance of method selection by stating, “segmentation methods shape the segmentation solution” (p. 75). In that regard, the debate surrounding the numerous methodological approaches, advantages and disadvantages, and points of criticism as well as the analyses of tourism application cases have been dealt with extensively in the literature (Dolnicar et al., 2018).
Nevertheless, since most traditional concepts are now considered unfitting and unsatisfactory, innovative segmentation approaches are necessary in order to conform to postmodern behavioral trends (Bigné & Decrop, 2019) and to continue to reach marketing goals (Firat & Shultz, 1997; Firat et al., 1995; Lyn & Smith, 2009). Boztug et al. (2015, S. 191) agree with this conclusion and argue, “if hybrid consumers exist in tourism and if they represent a substantial proportion of the market, the value of market segmentation as the strategic cornerstone of tourism marketing is in question.”
Machine Learning-Based Segmentation
Over the past years, a number of data-driven approaches promising specific advantages over traditional segmentation techniques have been introduced. Yet, these days, with the rapid development of new algorithms—especially in the field of machine learning—even more possibilities have emerged, albeit only hesitantly applied to tourism research (Egger, 2022a). All the same, a few exceptions including self-organizing maps (SOMs) have a long history in customer segmentation. Mazanec (1995), for instance, first applied SOMs to customer segmentation almost 30 years ago, and studies like those of Nilashi et al. (2019), segmenting customers based on online reviews, or Mattozo et al. (2022), segmenting the Brazilian tourism market, show that SOMs remain a promising approach still to this day. Ustebay et al. (2020) point out, however, that SOMs are rooted in the trial-and-error method and require long training periods. Artificial Bee Colony (ABC) (Kuo & Zulvia, 2018), Particle Swarm Optimization (PSO) (Chan et al., 2016), and Genetic Algorithm (GA) (Schifferl, 1998) have therefore been adopted as alternatives for segmentation purposes (Ustebay et al., 2020).
Take the following as an illustration: Penagos-Londoño et al. (2021) segmented tourists based on perceived sustainability and trustworthiness through the use of GA to identify relevant features and clustering by means of latent class analysis (LCA). Yet, Tsai and Chiu (2004) note that it is difficult to find an optimal solution using a method as such. Keeping this in mind, J. Li et al. (2020) also applied LCA to segment gamblers in a gambling destination, although multiple other studies embodied a two-stage approach instead. In Ceylan et al. (2021), Díaz-Pérez et al. (2020), and Supapakorn et al. (2022), for example, chi-square automatic interaction detection (CHAID) analysis was utilized as a decision tree algorithm to identify relevant features, followed by segmentation via multiple correspondence analysis (MCA). Regardless of these contributions, little interest in nonparametric models such as CHAID has been exerted in the literature on tourism market segmentation (Díaz-Pérez et al., 2020).
To provide a better overview of generated typologies and the concepts underlying segmentation, Table 1 below lists a selection of typological attempts, divided into theoretical-conceptual, empirical, and modeling approaches.
Tourism Typologies.
Source. Extended and adapted version of Marwijk and Taczanowska (2006).
A major point relevant for this study is that almost all empirical segmentation attempts to date have ignored distance measures between segments. While efforts have indeed been made to identify coherent clusters, investigating the distances between clusters has yet to be embraced and described. Answers to questions such as whether family vacations are closer to the seaside or to camping grounds and whether luxury holidays are closer in relation to relaxation and wellness or to shopping trips remain unanswered. In order to evaluate such similarities and differences, D’Urso et al. (2014) call for the linking of different types and roles that may better portray the multilayered tourist.
Methodology and Procedure
This study combines the theoretical lens of postmodernism with machine learning by proposing a multi-level segmentation approach that addresses the peculiarities of contemporary tourist behavior. As the research design is quite complex, an outline of the entire process has been depicted in advance in Figure 1. Thus, the basic idea should be comprehendible before the details of the individual steps are introduced. Additionally, an explanatory video has also been made available, which can be accessed either via the QR code in Figure 1 or via the link https://tinyurl.com/vectorize-me. This was created in order to make the proposed method more accessible to those readers who have little to no experience with Natural Language Processing (NLP) and machine learning.

Research design.
For this study, the travel types proposed by Gretzel et al. (2004) served as a frame of reference. As will become clear, these basic types only formed the starting point, and any blurring, for example, due to overlaps or missing types, was compensated for later. For each travel type, different photos were selected and then annotated. Thus, a textual description of each photo was created, corresponding to an intersubjective attribution of meaning in the semiotic sense. After this step, a document (the sum of the annotations of a photo) that describes it in many ways was made available for each photo. In order to be able to calculate with this text, they must first be converted into high-dimensional vectors. This is a typical process in NLP. As a result, each image contained its own image vector. A sample of 1,101 German tourists then performed a self-assessment based on these photos and evaluated the extent to which the pictures correspond to their way of traveling. At this point, two values were accessible: the image vector, which emerged from the embedding of the annotations, and the value of the self-assessment by the users. Multiplying the two values and summing them up for all images, resulted in a user vector, which corresponded to the image vectors weighted by the self-assessment. The sum of all the user vectors could now be used as the basis for the community recognition procedure.
Since text vectors can represent semantic similarity in high-dimensional spaces, the similarities between the individual vector types (image, user, cluster) could be determined, and measurable (semantic) distances were available for further analysis. This was followed by the evaluation of the clustering and the interpretation of the data. At the end of the method description, Table 2 lists how the semiotic-semantic community detection approach differs from classical segmentation approaches and which advantages and disadvantages arise from it.
Comparing Traditional and Semiotic-Semantic Segmentation Approaches.
Figure 1 provides an overview of the proposed method, entitled “a multi-stage machine learning-based segmentation process using semiotic-semantic community detection.”
The following sections will now present a more detailed step-by-step run-through of the semiotic-semantic community detection. The entire approach encompasses 10 individual steps. However, if one wishes to apply this approach to one’s own data, then the first six steps can be disregarded since all necessary documents like the photos, their vector representation, and the corresponding python codes can be downloaded.
Basis Typologies, Image Selection, and Rating
The present study focuses on the use of images for assessing travel preferences. Therefore, the objective of the first step was to select proper photos—based on existing typologies—for the self-assessment stage. This study’s research design adopted its theoretical foundation from the typologies established in Gretzel et al. (2004); however, their original 12 travel personalities (All Arounder, Avid Athlete, Beach Bum, Boater, City Slicker, Culture Creature, Family Guy, Gamer, History Buff, Shopping Shark, Sight Seeker, Trail Trekker) were revised and adapted so that 16 basic typologies (Action Beast, Beach Bum, Camper Tramper, City Slicker, Culture Creature, Entertainment Junkie, Ethno Traveler, Family Guy, History Buff, Luxury Chap, Nature Lover, Relaxing Character, Shopping Shark, Sight Seeker, Spiritual Enthusiast, Vivid Athlete) could ultimately be defined as the baseline for this segmentation approach. This adjustment was necessary because Gretzel et al.'s (2004) typology was geared more toward US citizens and their travel behavior. Nonetheless, the use of their proposed travel personalities in studies such as Jani et al. (2014), Mitsche (2016), Park et al. (2010), and Pérez-Tapia et al. (2022) demonstrates their wide acceptance across the scientific community. Countless other typologies could have also been used, but as one will come to realize later, some blurring did not matter at this stage.
For each travel type, six images that best embodied and characterized that category were sought. A total of 96 images (three images per 16 tourist types) were then presented to a convenience sample of 100 people, who made the final selection regarding which three images best fit each separate typology. After conducting a chi-square test, 48 images were determined as the most suitable representatives.
Thereafter, two different samples were used for the next two steps, whereby the second sample was only used in the subsequent method step and is thus presented there. Recruited in September of 2021 using an online panel, the first cleaned sample consisted of 1,101 Germans who had taken at least one vacation trip in 2019 with a minimum stay of four overnights. As a quota distribution representative for Germany with regard to the variables of gender, age, education, and federal state was taken into account, the sample thus describes outbound travel.
As shown in Table 3 the age of the respondents ranged from 18 to 75 years old, with an average age of 38.8, and the proportion of women (50.9%) and men (49.1%) was almost equally distributed. Approximately 37% stated to be single, while 58% were either married or in a relationship. Regarding education level, 42% possessed a secondary school degree, almost 29% had obtained a high school diploma, 5.5% had failed to complete their studies, and 22% had a university degree. When it came to household income, roughly two-thirds claimed to get by very well, well, or relatively problem-free, whereas one-third claimed to have difficulty with the amount of money available per month. Just over 13% said they had no experience with long-distance travel, with just under 28% having little experience, just over 35% having some experience, and approximately 23% having abundant experience with long-haul trips.
Sample Characteristics.
Participants were asked to self-assess, by means of a seven-point Likert scale, whether the presented images were in line with their type of traveling. In addition, questions were asked about their sociodemographic characteristics, psychographics, and travel behavior. A similar approach as such has been previously used in studies such as Neidhardt et al. (2015), Sertkan et al. (2019), and H. Berger et al. (2007).
Image Annotation and Vectorization
4. From here on out, the approach presented differs significantly from classical segmentation attempts. Traditional approaches would typically perform clustering or a similar procedure on the questionnaire responses. The semiotic-semantic method, however, aims to avoid processing these “raw data” and, instead, adds an extra step that removes the fuzziness from the initially-defined typologies, preserves maximum user data, and also introduces a semantic distance metric, which will be explained next.
To archive that, this approach involves the advantages of NLP by using the semiotic interpretation of the annotated images. Yet, to obtain the semiotic meaning attribution for each image and to take advantage of the semantic distance measures, all 48 images had to be annotated first. For this purpose, a second sample consisting of 622 international participants was used. This time, participants were required to assign two annotations to each image using the following format: [“traveltype-description” trip] (e.g., [Family Trip], [Beach Trip]). While the first annotation could be relatively unambiguous, the second annotation was supposed to specifically exemplify the interpretive diversity of the image. For instance, the image referred to as BEA1 (see the photo marked by a red * in Figure 3) in the “Beach & Family” category was annotated with the “traveltype-descriptions” listed in Table 4.
Top 15 Terms describing the BEA1 Image.
The findings from the annotative step revealed considerable variation for all the images, with an average of 210.3 unique annotations. Each term was weighted by the frequency of how many times it was mentioned, and the sum of the weighted texts (per photo) was considered the “descriptive document” of the image. The image interpretations annotated by 622 people thus generated 48 weighted text documents, one for each image.
5. In order to process textual data further, it was first converted into numerical values because machine learning algorithms require a fixed-length vector as the input value (Egger, 2022c). The procedure of text representation is well known in the field of NLP, and with the help of pre-trained language models, semantic similarities can be expressed using distance measures. For this purpose, the language models embed text in a vector space. This vector space can have any number of dimensions, but 100 dimensions have been proven to be suitable for such tasks (Walkowiak et al., 2019; Zhang et al., 2019). As a result, terms such as “ski” and “snow” have greater proximity to each other than, say, the terms “ski” and “mountain.”
Next, a suitable language model was required to embed the annotations. There are many language models on the market, but two in particular, Doc2Vec and BERT, were tested as they were considered most ideal. Arefieva and Egger (2022) point out that, when it comes to training language models, the use of domain-specific texts is of the utmost importance for producing high-quality results. For this reason, Doc2Vec and BERT were trained from scratch with a tourism-specific training corpus. A 4 GB-large training corpus, reflecting tourism’s linguistic diversity and intercultural dimension, was built by crawling a total of 3.6 million hotel, destination, attraction, and sightseeing reviews as well as approximately 40k attraction and destination descriptions from 20 different countries.
6. The results revealed that the less advanced Doc2Vec model outperformed the state-of-the-art BERT model since the annotations simply involve a pure list of strings without any context and/or complex meanings. This is in line with other studies, such as Edminston and Park (2020) and the case of FinBERT (a pre-trained NLP model used to analyze the sentiment of financial texts) (Devlin et al., 2019), which have also determined that, for such downstream tasks, Doc2Vec can in fact outperform state-of-the-art language models like BERT—regardless of whether or not it has been trained on a domain-specific corpus.
Two major advantages emerged from this approach: First, by using high-dimensional text vectors, all the information could be preserved; second, the word and document vectors were able to map semantic similarity in the high-dimensional vector space. Figure 2 thus shows three images, deriving from three different basic typologies, namely, “Nature Lover,” “Family Guy,” and “Camper Tramper,” in the vector space below (after using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) to reduce the 100 dimensions down to only three). Although they are diverse, and originally stem from three different travel types (see Figure 3), these images appear close to each other due to their semantically similar image interpretations. This is why, as mentioned at the beginning, the basic typologies only served as a frame of reference to be adapted later on. As a general result, exact and measurable distances between the image vectors could be established and used for further calculations. Due to the high dimensionality of the data generated in this way, the multi-optionality of hybrid customers, and thus their contemporary consumer behavior, could be better preserved and mapped in a more stable way.

Vectorized Images visualized in TensorBoard’s Embedding Projector.

Explorative Factor Analysis of the Images.a
Vectorizing and Clustering the User
7. To create a user vector that indicates to what extent a person identifies with each travel type and how similar individual users are, each image vector was multiplied by the user rating (the extent to which each image corresponded to the travel preferences of a person) of the image. This resulted in user-specific weighted image vectors. Then, a user’s 48 weighted image vectors were summed up to obtain the user vector itself. All the information is stored in this vector: On the one hand, not only can the user’s evaluation of each image be found but also the position of the image in comparison to all the other images in the vector space. Hence, all 1,101 users could be mapped in a 100-dimensional vector space. Regarding the latter, the same applies as with texts or annotated images: The more similar the user vectors are, the closer they are in the vector space and the more similar their travel behavior is. For the future, since image vectors only need be created once, survey participants need only rate the 48 images from which their user vector was created, rendering the similarity with other users in high-dimensional spaces measurable. When reproducing this approach using the same images, the corresponding image vectors may be downloaded and utilized.
8. In the next step, the Louvain algorithm (Blondel et al., 2008), a particularly robust graph-based clustering algorithm for high-dimensional data (S. Z. Li et al., 2020), was applied for the purpose of community detection. Louvain is based on modularity optimization and tries to detect communities by assuming, in the first iteration stage, that each node in the network represents a community of its own (Blondel et al., 2008). Subsequently, the change in modularity is calculated when node i is removed from its community and inserted into that of its neighboring node j. If no maximization in modularity occurs, the node remains in its current community (Sun, 2018). Both steps are performed continuously until modularity maximization is no longer possible through the unification of communities (Vatsal, 2022). This provides a big advantage over classical clustering algorithms because the number of clusters is calculated by the algorithm and no human assessment is needed.
For the present study, since community detection is not based on the features of the data, but rather the relationships between the data (Egger, 2022b), the first step was to normalize the self-assessment values collected from the 1,101 German participants and to create a network graph. Dolnicar (2002) mentions that high-dimensional data usually complicates the clustering task and fewer variables are generally better; however, thanks to the Louvain algorithm, all the information stored in the document embeddings could be clustered effortlessly. As Euclidean and Manhattan distances tend to break down in high-dimensional spaces, the cosine distance is also suggested (Egger, 2022c).
9. After the clustering of the user vectors, a dimensionality reduction must be performed for visualization purposes in which the 100-dimensional space is brought down to a three- or two-dimensional one. UMAP was selected for this step as it is well known for its ability to preserve the global structure of data (Oskolkov, 2022).
Cluster Evaluation
10. As an unsupervised method, no ground truth exists when it comes to evaluating the clusters, which means that extrinsic measures are ineffective and only intrinsic measures can be implemented. Within the context of community detection, modularity is employed as a quality measure for the partition (Labatut, 2015; Lancichinetti et al., 2010), typically ranging from −1 to 1, with 1 theoretically representing the best value (Singh & Garg, 2020). This study managed to achieve a modularity score of 0.667, indicating that the separation into clusters was very good, and was able to identify 14 distinct clusters. In cases where Louvain may suggest too many clusters or too few clusters with community members, then hyperparameters should be tuned or a hierarchical agglomeration of clusters, as suggested by Ozaki et al. (2017), can be carried out. The next, and final, step is the same as with conventional clustering procedures: The communities are compared by means of ANOVA with all the variables that were collected about the user and that are of interest.
At this stage, in order to avoid interpreting the expressions of all 48 images, an explorative factor analysis with oblimin rotation was performed based on the user ratings of the images. This grouped the 48 images into seven factors that could ultimately be interpreted more easily in the end. All in all, the sample (n = 1,101) formed the data basis, with the KMO criterion having an overall MSA of 0.929. Figure 3 presents the 44 images that could clearly be assigned to one of the seven extracted factors, while four remaining images were deemed inadequate and could not fit into any category.
Table 2 shows the main differences between the semiotic-semantic community detection approach and classical segmentation techniques and highlights its advantages and disadvantages.
Results
The results presented here refer mainly to the methodological process applied to the representative sample of the 1,101 German travelers. Regarding segmentation, criticism has consistently been cast upon any past attempts since measures of similarity between tourist types have ceased to exist. To fill this research gap, vectors were created for the previously extracted factors, and, after normalization, Euclidean distances were calculated. The data basis underlying this derives from the sum of the individual image vectors of those images that form a factor. As previously mentioned, the factors were merely used to obtain a better and more precise interpretation, thereby enabling seven factor-vectors to be interpreted and represented in place of the 48 individual images. Table 5 thus shows how similar these factors are. Note, as an example, from the inter-distance-map in Table 5, that the maximum distance possible can be observed between “Luxury” and “Beach & Family,” whereas “Ethnic & Culture” seems to be in proximity to “Attractions & City.” For the first time, the perceived similarity of travel types can be represented using a fixed distance measure.
Travel Typology Factors Inter-Distance-Map.
To further characterize each cluster, the mean values of all the image ratings were calculated for each factor per cluster prior to being normalized for better visualization. Accordingly, the radar plot in Figure 4 displays the 14 individual clusters, with the relative size and the factor ratings of each cluster, extracted from the German participants. Without a doubt, the provided cluster solutions demonstrate postmodern behavior, characterized by a “both/and” instead of an “either/or” mentality (Denzin, 1998).

Cluster Descriptions.a
Although postmodern segments are considered quite fluid rather than rigid, the following provides a brief outline of the 14 types of German travelers. Cluster 1, with 12.45%, embodies the largest community and contains members who exhibit universal behavior. Except for the factors of “Luxury” and “Beach & Family,” all the other levels basically reach maximum value. This looks very different in cluster 2 (12%): Here, they see themselves as active travelers with a tendency to fall under the “Beach & Family” category, whereby the “Luxury” component is also of great importance. Cluster 3, contrarily, could be described as alternative minimalists. This cluster, which includes approximately 8% of German travelers, is most likely to be identified using the factors “Nature,” “Ethno & Culture,” and “Camping.” Members of cluster 4 (7.82%) can do without luxury and active vacations, preferring, above all, attractions and cities. At the same time, they are also open to “Nature” experiences and “Ethno & Culture.” Cluster 5, with a size of 7.55%, is also characterized by passivity and presents its maximum value in the areas of “Beach & Family” and “Luxury.” The latter is a nice example of contrasting postmodern behavior since, as exemplified in Table 5, these two areas have maximum distance from each other.
Represented in cluster 6 (7.45%) are typical all-rounders, displaying average values in all areas. Similar to cluster 5, but demonstrating more extreme characteristics, are travelers in cluster 7 (7.27%)—they polarize between “Beach & Family” and “Luxury.” Camping is a no-go for them, but this group is also highly passive. Cluster 8, on the other hand, portrays a stark contrast to the previous description, completely renouncing “Luxury” and mainly abandoning “Beach & Family.” Instead, with a size of 6.73%, “Ethno & Culture” as well as “Attractions & City” are of greater significance to these cluster members. Regarding cluster 9 (6.55%), this group forgoes “Ethno & Culture” and “Attractions & City” altogether but defines “Beach & Family” as its most critical factor. All other factors have little to no relevance. Structurally similar, only with stronger characteristics, is cluster 10 (5.64%); here, both “Camping” and “Luxury” are comparatively more important.
Cluster 11 can be described as a group that largely refrains from active travel behavior and is only moderately open to camping, while all other factors in this 5.55% community are considered highly relevant. With maximum values for the factors of “Luxury” and “Beach & Family,” but high values for “Attraction & City” as well, the group in cluster 12 (5.50%) is also very divided in its behavior. In contrast, for members of cluster 13 (4.55%), the “Beach & Family” factor appears to have hardly any relevance, and all the other factors only pose a medium level of relevance. Turning to the last and smallest group, cluster 14 (3.55%) can ultimately be expressed as camping and nature-oriented tourists, who have little interest in the “Luxury” or “Beach & Family” categories.
While classical segmentation approaches primarily use sociodemographic, psychographic, behavioral, or geographic characteristics (Dolnicar et al., 2018) to classify people into groups, this study utilizes user vectors as input variables. As output variables, all characteristics of the cluster members can be analyzed, though images or image factors are also suitable as output for interpretation since the tautological problem was significantly mitigated through multiplication with the image vectors. Thus, to determine any differences between the clusters in terms of age, travel frequency, education, and federal state, and to be able to better describe them, several one-way ANOVAs and a chi-square test were conducted.
Regarding the test assumptions for age, a Levene’s test turned out to be significant (p = .001), indicating that the assumption of homogeneity of variance was violated, and normality was checked with a Q-Q plot to which no deviations were noted. The Kruskal-Wallis test was then also rendered significant, showing a significant difference between the 14 clusters: H(13, 1,087) = 80.26 and p ≤ .001. Subsequently, a Dunn’s post hoc test for non-parametric data was performed, which can be observed in Table 6 below along with its comparison to significant Bonferroni and Holm’s correction values. 1
Dunn’s Post Hoc Comparisons: Cluster/Age.
p < .05. **p < .01. ***p < .001.
To illustrate solely two results: Individuals in cluster 2, who are characterized by active travel behavior and a tendency to lean toward luxury, are significantly (pturkey ≤ .001) younger (M = 31.58, SE = 1.235) than travelers from cluster 4 (M = 44.41, SE = 1.530). For the latter group, both “Luxury” and “Active” travel are less important, valuing more the “Attractions & City” factor. On the other hand, German travelers in cluster 11, who especially appreciate “Ethno & Culture,” “Luxury,” and “Attractions & City,” but less “Activity,” are on average 45.79 years old (SE = 1.816), which is 14.2 years older (pturkey < .001) than the members of cluster 2.
Shifting to education, and likewise to the age variable, a Levene’s test was also deemed significant (p = .001), and normality was checked using a Q-Q plot. Therefore, a Kruskal-Wallis test was performed once again, implying a significant difference between the 14 clusters: H(13, 1,087) = 52.25, p ≤ .001. Table 7 shows the significant Bonferroni and Holm’s corrections, while the complete table with all significant Dunn’s values can be downloaded via the link provided in the footnotes.
Dunn’s Post Hoc Comparisons: Cluster/Education.
p < .05. **p < .01. ***p < .001.
When examining the characteristics of the clusters with respect to the level of education, one can note, for example, that travelers in cluster 1, who exhibit “normal” travel behaviors, have a significantly higher level of education than members from cluster 10, who display little interest in “Ethno & Cultures” and “Attraction & City” but present a high level of enthusiasm for “Beach & Family.” Interestingly, travelers in the smallest cluster (14), who value “Camping” above all and attach little significance to “Luxury,” possess the highest education level.
To determine any differences between the clusters in terms of experience with long-distance travel, an ANOVA was performed yet again, although this time the Levene’s test was insignificant (p = .337). Normality was checked with a Q-Q plot, with no deviations being noted. Overall, there was indeed a significant difference between the 14 clusters: F(13, 1,087) = 6.229, p < .001, and η2 = .069. Those comparisons, along with a significant Turkey correction, can be viewed in Table 8.
Post Hoc Comparisons: Cluster/Experience with Long-distance Travel.
p < .05. **p < .01. ***p < .001.
Note. Cohen’s d does not correct for multiple comparisons. p-Value adjusted to compare a family of 14.
A glance at the variable of experience with long-distance travel between the clusters shows, for instance, that individuals from cluster 5, who primarily appreciate “Luxury” and “Beach & Family,” have significantly (p = .001) more experience with long-distance destinations than members from cluster 9, for whom “Beach & Family” is significantly more important than any of the other factors. After taking a closer look into the preferred travel destinations of each cluster, these results could be validated further. Whereas members from cluster 9 prefer to avoid traveling too far (Germany, Italy, and Spain account for 35.38% of all mentioned destinations), members from cluster 5 favor a larger variety and more exotic destinations (here, Germany, Italy, and Spain account for only 23.34%).
Lastly, in order to explore differences between the clusters and the origin of the tourists according to their states, a chi-squared test-of-independence was performed, which, however, revealed no significant findings (p = .254). German travelers thus do not exhibit significantly different travel behavior regardless of federal state. This is particularly interesting as it proves that the former eastern German states have adapted their travel behavior to the old federal states.
As previously mentioned, the objective of this study lies in its methodological approach; therefore, including several detailed and meaningful analyses of the German travel market would go beyond the scope of this paper. Nevertheless, these results exemplify what types of substantive conclusions can be drawn when additional variables are included.
Discussion and Conclusion
Overall, this paper suggests a new methodological approach that considers the multi-optionality of tourists while also providing a novel measure to represent the distances of travel types, tourist clusters, and individual tourists in relation to each other. Hence, the presented method uncovers and allows for deeper, unprecedented insights into the complexity of tourist typologies as well as their overarching structure. For instance, recent studies have suggested that certain forms of travel have risen in popularity due to the COVID-19 pandemic (Neuburger & Egger, 2021; Rogerson & Rogerson, 2021). More simple, regional, and nature-conscious forms of vacation (Vaishar & Šťastná, 2022), such as camping (Ananda & Novianti, 2021; Uglis et al., 2022), agritourism (Chin & Pehin Dato Musa, 2021), or fitness-orientated tourism (Zhong et al., 2021), have gained importance and now complement many tourists’ previous forms of travel.
The present study makes an important contribution to theory in several respects. Principally, the goal of this study was to answer the initial, critical question of whether segmenting contemporary tourists could even be possible or if the tendency toward individualization had already progressed to such an extent that it could no longer be possible to identify any overarching patterns (thus rendering segmentation for marketing purposes pointless). First of all, the results revealed consistent postmodern behavior(s) as those described in the literature. The characteristics of the individual clusters confirmed that an “anything goes” (Maclaran, 2017) mentality is pervasive, blurring all existing categories and classifications. Nonetheless, this study also clearly revealed that, despite the omnipresent multi-optionality and even sometimes paradoxical individual travel behavior, overarching patterns can in fact be recognized and tourists can still be divided into homogeneous groups. In this respect, the hypothesis of the seemingly infinite possibilities for differentiation of hybrid tourist behavior ultimately being finite can be confirmed.
Furthermore, this study contributes to the methodological discussion by proposing a multi-stage segmentation process that successfully covers the multidimensionality of tourists with postmodern characteristics. This paper additionally adds knowledge to the existing body of literature on clustering methods and provides a rich attempt to overcome numerous problems involving classical segmentation techniques. Most segmentation approaches use a Likert scale to measure relevant variables regardless of the fact that Dolnicar et al. (2018), in their paper discussing the handling of ordinal data, recommends not to use them and D’Urso et al. (2021) suggests to convert Likert values into fuzzy data before clustering. Sertkan et al. (2020b) further emphasizes that participants oftentimes have difficulty rating their travel motivations, needs, and interests, a problem that can successfully be circumvented by utilizing representative photos to determine the tourist type they resonate with the most (H. Berger et al., 2007; Delic, 2016; Neidhardt et al., 2015; Sertkan et al., 2020a).
Keeping the latter in mind, the present study also made use of images for its self-assessment aspect; new in this context was the conversion of the images into descriptive text. By means of a cross-cultural sample, the complexity of semiotics could be taken into account (Arefieva et al., 2021) and the entirety of its information could be represented in document embeddings (Egger, 2022c). Moreover, owing to the advantage that a weighted semantic similarity could be determined between all formed vectors, a meaningful distance measure could be introduced. Another challenge of standard clustering approaches involves the number of clusters to be extracted and the subsequent evaluation of cluster solutions. This study was able to solve this problem by applying the Louvain community detection algorithm, which finds the most adequate cluster solution itself based on the best modularity partition (Celardo & Everett, 2020).
What is more, Dolnicar (2020) critically notes that too many variables are often used in clustering and therefore factor analysis should be performed in advance. This typically results in a massive loss of information since the dependencies between the individual variables and the distances between them vanish. In this proposed approach, however, factor analysis was performed with images only, and the actual clustering was conducted with user vectors. Thus, the factor analysis did not serve as a preprocessing for the clustering stage, but rather it facilitated more straightforward interpretations for after segmentation. Once clustering is complete, usually the sociodemographic, geographic, or behavioral characteristics of the extracted types are applied, meaning that the data used for clustering are not the same data used for data interpretation.
What could also be established is that segmentation solutions will have a fluid and amorphous character in the future, which is also why the individual clusters unveiled in this study were ultimately not given any identifying names (such as “the active generalists”)—as is generally the case. Unambiguous and universally valid groups as such will cease to exist; therefore, innovative methods and approaches that realign and adapt market segments with each additional data point are required (Dolnicar, 2002). This study’s approach merely requires new users to rate the images and re-run Louvain in order to generate new data points and continuously update segments. Additional collection of further data for characterizing the individuals is also recommended but certainly not mandatory.
On a final note, and from a managerial perspective, the presented methodology is of interest to all tourism service providers in need of a better understanding of their market(s) and hoping to change their marketing for the better. This approach thus allows the tourism industry to look at contemporary customers, and their multilayered complexity, from a holistic perspective. The step-by-step description of how to apply this study’s approach, especially via the provided explanatory video as well as the electronically available Supplemental Materials (the images used, the image vectors, and the Python codes), allows for the application of semiotic-semantic community detection. Additionally, while writing this paper, an online tool was developed to make the proposed approach accessible to small and medium-sized enterprises. This is important for several reasons: Tourism is an exceptionally fragmented industry, constituting the economic and societal backbone of many countries (Andreas, 2022). The fact that tourism is largely characterized by family-run SMEs operating in a global competitive environment calls for tools and assistance to be made accessible to this group as well—and not only to the big players. Especially in times of crisis—be it the effects of the COVID-19 pandemic or the Ukraine war and its associated economic impact—new IT-based approaches are all the more significant for remaining competitive (Gretzel et al., 2020). A customer-centric approach requires a fundamental understanding of customers, their wants, needs, and behaviors, and, in turn, a situational, motivational, and flexible analysis of the market.
The present study encountered several limitations. For example, the annotations of the images were made using a convenience sample, which could have resulted in an overrepresentation of younger participants and the interpretation of the images being biased. Another point of criticism could be that the individual perception of a photo during self-assessment did not coincide with the intersubjective annotation of the image, thus creating a distorted vector for that user. Although this problem was somewhat compensated for thanks to the relatively high number of evaluated photos, on the other hand, a great deal of effort was required to evaluate all 48 images.
Finally, it would be interesting to compare the results of this approach with other segmentation methods in order to truly confirm the postulated advantages. In a supplementary study, an attempt will be made to create destination vectors based on the textual descriptions of destinations. By applying similarity measures, user preferences could then be matched with destination characteristics, ultimately allowing the suggested method to be extended to a recommender system.
Supplemental Material
sj-zip-1-jtr-10.1177_00472875231183162 – Supplemental material for Vectorize Me! A Proposed Machine Learning Approach for Segmenting the Multi-optional Tourist
Supplemental material, sj-zip-1-jtr-10.1177_00472875231183162 for Vectorize Me! A Proposed Machine Learning Approach for Segmenting the Multi-optional Tourist by Roman Egger in Journal of Travel Research
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
