Deriving competitive intelligence from multifaceted user behavior data: An interpretable machine learning framework

Abstract

Competitive intelligence is essential for operations management decision-making. Beyond traditional offline information channels, firms increasingly gather online data and resources to generate comprehensive competitive intelligence. This study derives competitive intelligence in large markets by developing an interpretable machine learning framework that integrates multifaceted user behavior data, including user favorites, user-commented products, and user textual comments. Considering the complementary nature of these data sources, we first combine latent features derived from user favorites and user-commented products to improve submarket inference. Using these inferred submarkets as supervised signals, we connect user-commented products and associated textual comments to uncover consumer perceptions. We estimate the model using multifaceted data on online user behavior in the automotive domain. The results demonstrate that our model effectively improves submarket identification, captures consumer perceptions, and predicts competitive positions for new entrants. The derived competitive intelligence helps managers make more informed decisions in product operations and marketing strategies.

Keywords

Competitive Intelligence User Favorites Textual Comments Market Structure Machine Learning

1. Introduction

“If you know both yourself and the enemy, you will not be imperiled in a hundred battles.”

Sun Tzu, The Art of War

Firms are constantly looking for competitive intelligence to improve business operations (Kumar et al., 2020). Effective competitive intelligence enhances a firm's strategic position, but achieving this requires understanding the market from the perspectives of its competitors and consumers. Insights generated by consumers have proven valuable for providing competitive intelligence that helps businesses survive and succeed (DeSarbo et al., 2006; Netzer et al., 2012). To maximize the effectiveness of operations management strategies, this study integrates multiple types of online user behavior data, including user favorites, user-commented products, and textual comments, to offer managers a more complete picture of their competitive landscape.

The value of user-generated content (UGC) for competitive analysis is well established. Research has shown that online reviews and search logs contain rich information about consumer preferences and competitive dynamics (Bernstein et al., 2019; Chen et al., 2020). Previous UGC-based studies assumed that products that co-occurred in a single review or search session are potential competitors and applied clustering methods (e.g., K-means or community detection) to infer market structure (Netzer et al., 2012; Ringel and Skiera, 2016). Additionally, based on the unstructured text (e.g., review contents and social tags), several studies used Natural Language Processing (NLP) methods (e.g., rule-based text mining and latent Dirichlet allocation [LDA]) to analyze latent dimensions or aspects in which the target product and its competitors compete (Liu et al., 2021; Nam et al., 2017; Ye et al., 2022).

While these methods provide valuable and distinctive insights, several challenges remain. First, competitive information is typically scattered across multiple data sources. Individual data sources only provide partial insights. For example, online reviews rarely contain direct comparisons between competing products; a review is unlikely to explicitly compare a Tesla Model 3 with a BMW i4. This creates a sparsity problem in co-occurrence when inferring competitive relationships, particularly for niche products or new market entrants. Second, managers need clear competitor identification and a deep understanding of consumer perceptions, but existing approaches struggle to provide both simultaneously. Some researchers (Tirunillai and Tellis, 2014) have focused on extracting product attributes and perceptions from online reviews, but limited their analysis to predetermined sets of competing products. Others (Liu et al., 2019; Netzer et al., 2012; Ye et al., 2022) use a two-stage approach, first clustering products into competitive groups and then analyzing perceptions within those groups. However, this sequential approach means that errors in identifying competitive relationships can significantly impact the subsequent analysis of consumer perceptions. Third, competitive landscapes are constantly evolving as new products enter the market. While recent studies have made progress in analyzing market structure, predicting competitive positions for new entrants remains challenging. This is particularly true in markets with rapid product innovation and changing consumer preferences, where historical patterns may not fully capture emerging competitive dynamics.

To tackle these challenges, we develop an interpretable machine learning approach. This approach leverages the complementary nature of multifaceted user behavior data, including (1) user favorites, (2) user-commented products, and (3) user textual comments, as illustrated in Figure 1. Specifically, we combine user favorites and user-commented products to improve market-structure inference, as they capture different stages of the consumer decision process. In our context, user-commented products referring to products mentioned in user comment threads, represent early-stage exploration behaviors. These products indicate that users are actively gathering information or discussing options, reflecting a broader set of items that have attracted attention. In contrast, user favorites could suggest that consumers may have narrowed their choice set and are closer to making a purchase decision. For example, in Figure 1, we observe that the user-commented products partially overlap with those in the user favorites (e.g., Audi A5, BMW 3) while also including additional options (e.g., BMW 5).¹ We posit that by integrating these complementary signals, we can mitigate potential biases or blind spots that may arise from relying on a single data source (Ouyang et al., 2018; Pant and Sheng, 2015).

Figure 1.

The interactions between users and products on the website.

To extract user perceptions of competitive products, we establish conditional connections between user-commented products and their corresponding textual comments. These connections allow us to analyze how consumers compare competing products and what attributes drive their evaluations. Specifically, the textual comments (i.e., labeled by the purple square in Figure 1) often contain rich information about specific product features and competitive factors. For example, in Figure 1, we observe that this user places significant emphasis on the configuration (e.g., tire size and brand) and appearance (e.g., front face and headlights) when considering rival cars to the Audi A5, BMW 3, and BMW 5. Furthermore, textual comments capture user sentiment, showing the factors influencing preferences and aversions toward a product.

To implement this integrated analysis, we propose the Hierarchically LDA (HLDA) model, which extends classical topic modeling approaches to incorporate multifaceted user behavior data. HLDA is built upon unsupervised topic models, namely LDA (Blei et al., 2003) and Correspondence LDA (Blei and Jordan, 2003). Specifically, to infer the market structure, we extend LDA by introducing a two-component mixture strategy to embed latent features and allocate user-commented products to distinct competitive submarkets. This extension uses latent features from favorites to improve submarket inference while also considering how products are discussed together in comments. Additionally, based on the latent features of current products and inferred submarkets, HLDA can predict competitive positions for new entrants. To further extract consumer perceptions, we extend Correspondence LDA to model the conditional relationship between submarkets derived from user-commented products and their associated textual comments. Through this extension, the model identifies latent topics from textual comments, which we interpret as competitive dimensions. These competitive dimensions capture the specific product attributes that consumers consider when comparing products within each submarket.

We apply the HLDA to a large-scale dataset with 324,641 favorite lists and 354,703 user-commented products. We evaluate the model through a set of focused analyses that demonstrate its ability to improve submarket identification, capture consumer perceptions, and predict competitive positions for new entrants. Additional validation through simulations, benchmark comparisons, and external data further supports the robustness of our approach. Overall, the results highlight the value of integrating multifaceted UGC signals for competitive intelligence.

This research makes the following contributions to the existing literature: (1) We explore the value of integrating multifaceted UGC information for competitive intelligence analysis. Existing literature has focused on extracting insights from single data sources, for example, online reviews to identify competing products (Netzer et al., 2012), search logs to map market structure (Ringel and Skiera, 2016), or textual comments to understand consumer perceptions (Tirunillai and Tellis, 2014). Our study demonstrates how combining different types of UGCs can provide a more complete picture of market competition. By analyzing the complementary relationships among user favorites (which reflect serious consideration), commented products (which capture broader attention), and textual comments (which provide context), we develop a more comprehensive understanding of product competitive relationships and consumer perceptions. (2) Our methodological contribution lies in developing an interpretable machine learning approach for analyzing multiple data types. From a technical perspective, we extend LDA to integrate user favorites with user-commented products, improving submarket inference. To extract consumer perceptions, we leverage Correspondence LDA by modeling the conditional relationship between user-commented products and their associated textual comments. The data integration strategies developed could serve as a template for operations management and marketing researchers seeking to analyze various complementary data sources.

2. Related work

Market structure analysis involves understanding the competitive relationships between products and how these relationships influence firm operations (Cooper and Inoue, 1996). The analysis can be constructed from either a supply or a demand perspective. The supply side determines whether the products are substitutes based on firm characteristics, strategy, performance, and managerial cognitions (Voleti et al., 2015), while the demand side assesses them through consumer perceptions (DeSarbo et al., 2006). In recent years, demand-side analyses have gained prominence as they directly reflect how consumers evaluate and choose between products in real decision-making scenarios.

The proliferation of online platforms has transformed how firms conduct market competition analysis. The Internet enables the timely analysis of market competition through UGCs. For example, early work has presented methods for extracting consideration sets to map market structure using online reviews (Netzer et al., 2012). However, such analyses can be challenging since it is rare for two competing entities to be cited in the same sentence fragments in online reviews (Valkanas et al., 2017). For instance, it is uncommon to see reviews stating, “Toyota is better than Honda.” To quantify this challenge, we collected consumer reviews on edmunds.com and autohome.com and found only 1.1% and 0.13%, respectively, of total reviews mentioned more than one car. This sparsity makes it difficult to reliably infer competitive relationships from reviews alone.

Rather than relying solely on textual reviews (i.e., Netzer et al. (2012)) to extract competitive products, we combine user-commented products and favorite lists to analyze market structure. This integration helps overcome the data sparsity issues associated with online reviews. Subsequent studies have explored different types of UGCs, including consumer search (Ringel, 2023; Ringel and Skiera, 2016), purchase records (Gabel et al., 2019), brand-user networks (Yang et al., 2022), and user favorites (Liu et al., 2020), to analyze market competition. While these data sources provide valuable insights, they often capture different aspects of consumer decision-making. Our work advances this literature by examining how multiple UGC types can be integrated to provide complementary perspectives on competition, rather than treating each data source in isolation.

In addition, this study is related to the literature on topic models. Recent studies have used LDA-type models to extract latent topics from unstructured textual and multimodal data for applications such as identifying influential users (Igarashi et al., 2025), analyzing consumer engagement on social media (Liu et al., 2025), inferring purchase intentions (Li and Ma, 2020), and forecasting sales based on sentiment signals (Lau et al., 2018). Building on this stream of research, we extend LDA (Blei et al., 2003) and Correspondence LDA (Blei and Jordan, 2003) to address the requirements of market-structure analysis. First, we use an embedding method to extract the latent features from user favorites. Unlike traditional approaches that analyze only textual content, we apply embedding techniques to capture competitive relationships from behavioral data. Then, we incorporate latent features and user-commented products into LDA to enhance submarket inference. Second, we extend the Correspondence LDA to establish the conditional relationship between the submarkets hidden in user-commented products and user perceptions (i.e., topics) in their associated textual comments.

Due to the space limit, we only review the prior work most related to this study. In Appendix A, we first position our work in the literature along several dimensions and show how our research bridges the gaps in the literature. Then, we discuss an extensive list of studies about user favorites and textual comments.

3. The proposed model

This study proposes the HLDA model, outlined in Figure 2. In Figure 2, the shaded nodes are the observed variables. HLDA is implemented using a three-step unified framework to combine user favorites, user-commented products, and textual comments. In the first step, we apply an embedding method, Word2Vec (Mikolov et al., 2013), to extract the latent features from user favorites. These features serve as fixed inputs to subsequent steps. Products with similar latent features in the vector space tend to be competitors, as these similarities reflect how consumers group them during product evaluation. In the second step, we extend LDA to model the user-commented products by integrating the latent features learned from user favorites in the first step to identify submarkets. In the third step, we extend Correspondence LDA's approach to model the conditional relationships between submarkets inferred from user-commented products and the topics derived from associated textual comments. To provide a clear roadmap, we summarize the three types of data sources and their roles within the HLDA in Table 1. Next, we provide the details.

Figure 2.

Schema of the hierarchically latent dirichlet allocation (HLDA) model.

Table 1.

Overview of data sources and their roles in hierarchically latent dirichlet allocation (HLDA).

Data source	Description	Purpose in the model	Implementation context
User favorites	Lists of products favorited by the user	To capture product similarity based on user favorites	Used in Step 1 via Word2Vec to learn product latent features
User-commented products	The sequence/list of model-specific forums or products on which a user has posted comments	To identify submarket structures via product co-occurrence patterns	Integrated with Step 1 to infer submarket assignments
Textual comments	The text written by users for the commented products	To explain why products compete in a specific submarket	Integrated with Step 2 to infer user perceptions conditioned on submarkets

3.1 Step 1: Using Word2Vec to learn latent features from user favorites

To capture competitive relationships in user favorites, we use the Word2Vec method (Mikolov et al., 2013) to learn product latent features. Word2Vec was originally designed for NLP, where it analyzes word co-occurrence in small sliding windows within a set of documents. In our application, we apply this approach to user favorites, treating products as “words” and favorites lists as “documents.” When users frequently save certain products together (e.g., similar luxury electric vehicles (EVs)), this suggests these products share common attributes and may be competitors.

For each product $e_{i}$ , we obtain its corresponding latent features $ϖ_{e_{i}}$ that represent its position in the competitive space. The latent features serve as inputs to our subsequent market-structure analysis, in which products with similar vectors are more likely to compete.² Figure 3 illustrates the product relationships using a two-dimensional Principal Component Analysis (PCA) projection based on these products’ latent features. The luxury EVs (i.e., Tesla_Model_Y, BMW_iX3, and Audi_e-tron) are clustered together, indicating they share similar latent features and, thus, are perceived as competitors in the market. Intuitively, the extracted latent features are valid predictors of competitive relationships.

Figure 3.

Examples of two-dimensional principal component analysis (PCA) projection of the latent features.

3.2 Step 2: Modeling user-commented products with latent features

We extend the LDA model to identify submarkets by analyzing user-commented products alongside latent features derived from user favorites. Following the LDA analogy, user-commented products serve as documents, individual products as words, and submarkets as topics. However, unlike standard LDA, we incorporate the latent features learned from user favorites in Step 1 to improve submarket inference. In HLDA's generative process, user-commented products are generated using a two-component mixture strategy (Nguyen et al., 2015). One component follows a categorical distribution similar to LDA, while the other component uses the softmax function to link the submarkets in the latent feature space.

Formally, let U denote the number of users, with each user $u (u = 1, 2, \dots, U)$ commenting on $N_{u}$ products, denoted as $e_{u} = {e_{u 1}, e_{u 2}, \dots, e_{u N_{u}}}$ . The $e_{u n}$ denotes the nth product in user u 's commented products. Let $| E |$ denote the total number of unique products across all user comments.

Submarkets. We define K submarkets, where each submarket k has two components. First, a product distribution $ϕ_{k} = {ϕ_{k 1}, ϕ_{k 2}, \dots, ϕ_{k | E |}}$ , an $| E |$ -dimensional vector, that captures the likelihood of each product appearing in submarket k. $ϕ_{k}$ represents the product competitive relationships in user-commented products, and similar to the LDA, follows a Dirichlet distribution:

\begin{matrix} ϕ_{k} \sim Dirichlet (β) \end{matrix}

(1)

where the hyperparameter

β

controls the sparsity of the product distribution in submarkets. Intuitively, a higher value of

ϕ_{k e}

indicates greater competitive strength of a product e in submarket k. For example, in a performance-oriented EV submarket, a high

ϕ_{k e}

for a Tesla_Model_S suggests that it is a dominant competitor in this space.

Second, a submarket embedding $τ_{k}$ represents submarket k as a vector in the same latent feature space as $ϖ_{e_{i}}$ learned from user favorites as in Section 3.1. $τ_{k}$ captures a latent pattern of product features and attributes that characterizes a specific competitive submarket. For example, in the EV market, one $τ_{k}$ might represent the luxury EV segment by emphasizing premium features, while another $τ_{k}$ might represent the affordable EV segment by emphasizing value-driven features. We use $τ_{k}$ to measure how well a product fits in a submarket based on its similarity patterns in user favorites.

This dual submarket representation integrates direct product co-occurrence patterns in user-commented products (i.e., $ϕ_{k}$ ) with similarity relationships derived from user favorites (i.e., $τ_{k}$ ). In our model, $ϕ_{k}$ and $τ_{k}$ serve as complementary components that jointly determine the market structure.

User-Commented Products. When generating a user u's commented products, we first sample this user's preference distribution $θ_{u} = {θ_{u 1}, θ_{u 2}, \dots, θ_{u K}}$ from the Dirichlet distribution with parameter $α$ :

\begin{matrix} θ_{u} \sim Dirichlet (α) \end{matrix}

(2)

where its element

θ_{u k}

represents the probability that user u would comment on products from submarket k.

In analyzing user comment patterns, we observe that consumers engage with products in two distinct ways: one anchored in users’ favorites, and another independent of favorites, in which product selection follows the unique co-occurrence patterns observed in user-commented products. To model this dual behavior, we use a two-component mixture strategy to generate each commented product $e_{u n}$ . Specifically, we introduce a binary indicator variable $s_{u n}$ (Rakesh et al., 2018; Zhuang et al., 2014) to determine from which component of a submarket ( $ϕ_{k}$ or $τ_{k}$ ) the product $e_{u n}$ is generated. The value of $s_{u n}$ is sampled from a Bernoulli distribution (Rakesh et al., 2018; Zhuang et al., 2014) with parameter $λ$ :

\begin{matrix} s_{u n} \sim Bernoulli (λ) \end{matrix}

(3)

When $s_{u n}$ = 0, we use the component of the submarket's product distribution $ϕ_{k}$ to generate this product (i.e., driven by comment-based product distributions, independent of user favorites), similar to LDA. Conversely, when $s_{u n} = 1$ , we use the component of the submarket's embedding $τ_{k}$ to generate this product (i.e., driven by similarity to already-favored products), following Dieng et al. (2020). Mathematically,

\begin{aligned} e_{u n} | f_{u n}, s_{u n}, {ϕ_{k}}_{k = 1}^{K}, {τ_{k}}_{k = 1}^{K} \\ \sim {\begin{matrix} Categorical (ϕ_{f_{u n}}) when s_{u n} = 0 \\ p (e_{u n} | f_{u n}, τ_{f_{u n}}, ϖ_{e_{u n}}) when s_{u n} = 1 \end{matrix} \end{aligned}

(4)

where

f_{u n} \in {1, 2, \dots, K}

is the latent submarket assignment indicating which submarket the commented product

e_{u n}

belongs to, following LDA with

f_{u n} \sim Categorical (θ_{u})

p (e_{u n} | f_{u n}, τ_{f_{u n}}, ϖ_{e_{u n}})

is a softmax function used to project the latent features into discrete probability distributions over the

| E |

products. Similar to Dieng et al. (2020), this function is defined as:

\begin{matrix} p (e_{u n} | f_{u n}, τ_{f_{u n}}, ϖ_{e_{u n}}) = softmax (τ_{f_{u n}} ϖ_{e_{u n}}) = \frac{exp (τ_{f_{u n}} ϖ_{e_{u n}})}{\sum_{e^{'} \in E} exp (τ_{f_{u n}} ϖ_{e^{'}})} \end{matrix}

(5)

3.3 Step 3: Modeling user textual comments by considering submarket

To understand consumer perceptions of submarkets, HLDA analyzes textual comments and leverages the submarket assignments of the products in the comments as supervised signals. This approach is inspired by Correspondence LDA (Blei and Jordan, 2003), originally designed to model the conditional dependence between visual and textual topics. We apply this framework to establish the conditional relationship between submarkets and the topics discussed in the associated comments.

Suppose user $u$ 's textual comments contain $M_{u}$ words, denoted by $w_{u} = {w_{u 1}, w_{u 2}, \dots, w_{u M_{u}}}$ . The $w_{u m}$ denotes the mth word in user $u$ 's comments. Let V denote the vocabulary size.

Topics. Following Correspondence LDA's design principle, we build a one-to-one correspondence between the topics and submarkets, resulting in K competition-related topics since these are K submarkets. Each topic k is defined as a $V$ -dimensional vector $ψ_{k} = {ψ_{k 1}, ψ_{k 2}, \dots, ψ_{k V}}$ that captures the distribution of words associated with this topic. It follows a Dirichlet distribution:

\begin{matrix} ψ_{k} \sim Dirichlet (β^{'}) \end{matrix}

(6)

where the hyperparameter

β^{'}

controls the sparsity of word distribution in competition-related topics.

Textual Comments. When generating words in comments, we condition on the submarkets of products being discussed. For each word token $w_{u m}$ , we follow the Correspondence LDA literature (Blei and Jordan, 2003) and sample topic assignment $z_{u m}$ from the uniform distribution over the submarkets ${f_{u 1}, f_{u 2}, \dots, f_{u N_{u}}}$ associated with user $u$ 's commented products. Then, we generate this word from the corresponding topic using the categorical distribution:

\begin{matrix} z_{u m} \sim Uniform (f_{u 1}, f_{u 2}, \dots, f_{u N_{u}}), w_{u m} \sim Categorical (ψ_{z_{u m}}) \end{matrix}

(7)

For example, if a user comments on four products with submarket assignments ${1, 1, 2, 3}$ . Thus, for the word token $w_{u m}$ , it would be assigned topic 1 with probability 1/2 (=1/4 + 1/4), assigned topic 2 with probability 1/4, and assigned topic 3 with probability 1/4.

This uniform distribution ensures that words are more likely to be associated with submarkets that dominate the user's commented-product set. By having the probabilistic link between products and words, our model captures how consumers articulate their perceptions about specific competitive submarkets, rather than treating textual comments as disconnected from the competitive context.

3.4 Model inference

In Appendix B, we outline the generative process of HLDA. The model requires estimating four latent variables ${f, z, s, τ}$ , representing submarket assignments, topic assignments, binary indicators, and submarket embeddings, respectively. Since the direct derivation of their posterior distributions is intractable, we develop an iterative inference procedure that combines Gibbs sampling with regularized maximum likelihood estimation.

Our inference algorithm proceeds as follows: in each iteration, we first fix the values $τ$ , then use the Gibbs sampling algorithm to sample the values $f$ , $z$ and $s$ . We construct a Markov chain over the latent variables to collect independent samples for estimating the target posterior distribution. Subsequently, based on the sampled $f$ , $z,$ and $s$ , we optimize $τ$ using regularized maximum likelihood estimation. Appendix B presents the mathematical derivations and detailed procedures for this inference process.

3.5 Deriving competitive intelligence

After running the Gibbs sampling algorithm until convergence, we estimate $ϕ_{k e}$ , $ψ_{k v}$ , and $θ_{u k}$ using the posterior expectation. This expectation provides optimal point estimates under the Bayesian framework by minimizing expected squared error. Additionally, using posterior expectations across multiple sampling iterations reduces random noise, leading to more stable and reliable estimates. From these parameters, we derive several key metrics for competitive intelligence.

3.5.1 Determining Competitive Strength

Let $Φ_{k e}$ be the probability of product e generated by submarket k. It signals the competitive strength (or leadership) of product e in submarket k. A higher value of $Φ_{k e}$ indicates that product e is more frequently considered and compared with other products in submarket k, and is estimated as:

\begin{matrix} Φ_{k e} = (1 - λ) ϕ_{k e} + λ softmax (e | τ_{k} ϖ_{e}^{T}) \\ = (1 - λ) \frac{n_{k}^{e} + β}{\sum_{e^{'} = 1}^{| E |} (n_{k}^{e^{'}} + β)} + λ softmax (e | τ_{k} ϖ_{e}^{T}) \end{matrix}

(8)

where

n_{k}^{e}

denotes the number of times the product e is assigned to submarket k, and

λ

is the contribution weight defined in Section 3.2. As shown in Equation (8),

Φ_{k e}

includes two components. The first component captures product relationships from user-commented products, while the second reflects relationships based on user favorites. This dual-source approach mitigates potential biases from relying on either source alone, providing a more robust assessment of competitive relationships.

3.5.2 Understanding user perceptions

An aspect of competitive intelligence is understanding consumer perceptions within each submarket. By analyzing the posterior distribution of $ψ_{k}$ , we can identify the key attributes and concerns that consumers associate with products in each submarket:

\begin{matrix} ψ_{k v} = \frac{N_{k}^{v} + β^{'}}{\sum_{v^{'} = 1}^{V} (N_{k}^{v^{'}} + β^{'})} \end{matrix}

(9)

where

N_{k}^{v}

represents the number of times word v assigned to a competition-related topic.

To evaluate the sentiment (positive, negative, or neutral) associated with each topic, we use an explainable deep learning method, namely the sentiment lexicon-aware attention network (SLAN) (Wu et al., 2019). Unlike simple lexicon-based approaches, SLAN handles contextual information through attention mechanisms and properly accounts for negation and intensity modifiers. Implementation details are provided in Appendix C.

3.5.3 Analyzing user preferences over submarkets

We investigate the specific types of consumers who show interest in the submarket. Using our model, we can obtain each user's preference distribution over each submarket by:

\begin{matrix} θ_{u k} = \frac{n_{u}^{k} + N_{u}^{k} + α}{\sum_{k^{'} = 1}^{K} (n_{u}^{k^{'}} + N_{u}^{k^{'}} + α)} \end{matrix}

(10)

where

n_{u}^{k}

counts the number of products in user

u

's commented-products assigned to the submarket k.

N_{u}^{k}

counts the number of words in user

u

's textual reviews related to submarket k. Note that the larger the value of

θ_{u k}

, the more actively user u engages in discussions or talks about a specific submarket.

3.5.4 Determining submarket popularity

The popularity of a submarket represents how frequently it is considered by consumers in the marketplace. This metric provides key information for managers to identify high-interest segments and monitor shifting consumer attention patterns. We define submarket popularity $ρ_{k}$ :

\begin{matrix} ρ_{k} = \frac{\sum_{u} θ_{u k}}{\sum_{u} \sum_{k} θ_{u k}} \end{matrix}

(11)

This measure aggregates individual user preferences across the entire user base, providing a market-level view of submarket importance.

4. Empirical application

4.1 Data description

We collect data from autohome.com.cn, the leading online transaction platform for automobiles in China. The platform consists of model-specific discussion forums, each corresponding to a particular car model (e.g., Audi A5). Within each forum, posts are organized by discussion threads. Users are not restricted to participating in a single forum. In fact, users frequently discuss and compare vehicles across model-specific forums, guided by their interests. As a result, a user's commented products may include vehicles from different categories (e.g., electric and gasoline-powered cars). In addition to participating in forums, the platform allows users to add vehicles of interest to their personal favorite lists. To construct user-level data, we develop a web-crawling program that starts from individual user profile pages (see Figure 1) and collects each user's favorited vehicles, commented vehicles, and the associated textual comments.

We perform several preprocessing operations, including filtering out users with minimal activity, standardizing product nomenclature, and using domain-specific word segmentation to enhance analytical precision. Appendix D reports descriptive statistics for user favorite lists, user-commented products, and the associated textual comments. The data contains 324,641 favorite lists and 354,703 user-commented products on 2330 car models from 434,258 unique users. The data spans from September 2017 to December 2021. On average, users comment on 4.3 cars with a standard deviation of 2.7, and the average size of a favorite list is 4.0 with a standard deviation of 2.3.

4.2 Market structure analysis

In this section, we present the results of the market structure analysis. We set the hyper-parameter $β = β^{'}$ at 0.01 and $α$ at $50 / K$ , as suggested by Liu and Toubia (2018) and Griffiths and Steyvers (2004). The mixture weight λ is optimized using the Newton-Raphson method and is set to 0.7. To ensure convergence of the Gibbs sampler, we run 5000 iterations and use the first 4500 as burn-in (see the convergence analysis in Appendix E). For the last 500 iterations, we measure the posterior means of each latent variable. For the Word2Vec model, we set the context size at 5, considering the short length of a user's favorites based on the data statistics. The dimensions of a product's latent features are set to 200. The training epochs are set to 15. To evaluate the influence of Word2Vec's hyperparameters on model performance, we also perform a sensitivity analysis in Appendix E. In addition, our model involves an important hyperparameter: the number of submarkets (i.e., the number of competition-related topics). We apply a widely used metric, namely perplexity (Blei et al., 2003; Huang et al., 2018), to obtain the optimal parameters and finally set $K = 60$ . We provide more details related to the procedure in Appendix F.

To visualize the competitive market structure, we select the top 20 representative cars from each submarket based on their competitive strength determined by Equation (8). We then establish the connections between the submarket (represented by the red node) and its car members (other color nodes). To visualize these connections, we used Gephi (Bastian et al., 2009) for graph and network analysis. Figure 4 showcases the global market network, revealing the competitive market structure. Among the inferred 60 submarkets, there are 720 unique cars without double-counting. Figure 4 provides a clear view of product competition in a specific submarket.

Figure 4.

The visualization of the competitive market structure.

To demonstrate how managers can derive competitive intelligence from this network, we focus on Tesla's positioning in the rapidly evolving EV market. In submarket 49, three Tesla models (i.e., Model_3, Model_S, and Model_X) compete directly with vehicles from established luxury brands and emerging EV specialists like BYD and NIO. This submarket analysis reveals not only Tesla's current strong position but also growing competitive challenges that might be overlooked in traditional analysis frameworks.

The network structure in Figure 4 also reveals important insights through submarket adjacencies. Submarket 40 (containing premium European EVs like Audi e-tron and BMW iX3) and submarket 50 (dominated by emerging Chinese EV brands) share boundaries with submarket 49. These proximity relationships highlight potential cross-shopping behavior, where consumers evaluate options across market segments. For Tesla's strategic planning, understanding these adjacent market dynamics is crucial for maintaining a competitive advantage against established luxury manufacturers and new entrants.

Furthermore, our analysis shows that some Tesla models (e.g., Model_Y) appear in multiple submarkets, indicating increasingly fluid market boundaries. This finding suggests that consumers no longer restrict their attention to products within a single segment; instead, they evaluate options across different price tiers and categories. To further assess the generality of this pattern, we follow social network theory (Newman et al., 2002) and conduct a network degree analysis, in which the degree of a car is defined as the number of submarkets it simultaneously belongs to. Figure 5 reports the degree distribution of car models and shows that 44.3% of cars (319 out of 720) appear in two or more submarkets.

Figure 5.

Degree distributions of car models.

Although we classify the cars in our dataset into 60 fine-grained submarkets to provide a macro-level of the market landscape, we can also zoom in on these submarkets based on the specific questions and objectives managers ask. In Appendix G, we use vehicle size and price to map the relationships among submarkets, suggesting that the multi-granularity of market segmentation helps managers better understand the competitive landscape and make targeted decisions.

4.3 Rival products in submarkets

Figure 6 presents the competitive strength rankings of the top ten flagship vehicles in submarket 49 and its adjacent submarkets (i.e., 21, 40, and 50). This micro-level analysis provides several insights for Tesla's competitive strategy.

Figure 6.

Competitive strength of cars in four submarkets.

In submarket 49 (premium EVs), Model_3 maintains dominant market leadership, with a competitive strength of 0.2866, significantly higher than any competitor. This competitive strength quantifies Model_3's centrality in the submarket network, reflecting its greater consumer attention relative to other vehicles in the segment. However, this dominance does not transfer equally across submarkets. While Model_3 maintains a presence in the traditional luxury submarket 21 (where it competes with Audi_A4L and BMW_3), its competitive strength is substantially lower (0.1152), indicating a more challenging competitive position. Similarly, Model_Y shows varying competitive strengths across submarkets 40 (European premium EVs) and 50 (emerging Chinese EVs). In submarket 40, Model_Y faces intense competition from established European luxury brands with a long heritage in premium segments. Conversely, in submarket 50, Model_Y competes against rapidly advancing Chinese EV manufacturers.

This variable performance across submarkets reveals an important strategic insight: even dominant brands like Tesla must develop submarket-specific competitive strategies rather than applying uniform approaches across segments. For instance, in submarket 21, Tesla might emphasize performance and total cost-of-ownership advantages over traditional luxury vehicles, while in submarket 50, the focus might shift to technological superiority and brand prestige to justify premium pricing relative to less-established competitors.

In Appendix H, we further examine submarket popularity based on Equation (11). Our analysis reveals a highly skewed distribution of consumer interest, with four dominant submarkets attracting over 4% of user attention each, while nearly 40% of submarkets remain relatively niche (popularity < 0.01). Specifically, within the EV sector, we find that emerging Chinese EV submarkets (e.g., submarket 50) have already surpassed some established European premium segments in popularity, suggesting a shift in competitive dynamics that may challenge traditional leaders like Tesla. Also, we identify a statistically significant but weak positive correlation (r = .29, p < .05) between submarket popularity and the number of competing products. This finding suggests that more popular submarkets tend to attract more competitors.

Table 2.

Topics and sentiment characteristics of the sampled submarkets.

4.4 User perceptions of submarkets

Beyond identifying competitive relationships, our model extracts user perceptions (Equation (9)) that reveal how customers evaluate products within each submarket. Table 2 presents the competition-related topics and associated sentiment analysis for submarket 49 and its adjacent submarkets. The sentiment words in Table 2 report the top 10 sentiment-related terms ranked by their frequency within each topic. The polarity score is calculated based on the relative proportions of positive and negative comments associated with each topic. This analysis reveals perception patterns across EV submarkets that provide actionable intelligence for product development and marketing.

Specifically, in submarket 49, technology and intelligence features dominate consumer discussion, with frequently mentioned terms such as “intelligent,” “automation,” and “advanced driver” in the associated topic. These discussions carry predominantly positive sentiment (polarity score = 0.428), with words such as “innovative” and “advanced” appearing frequently. This alignment between Tesla's technology-focused positioning and positive consumer perceptions is consistent with the effectiveness of Tesla's differentiation strategy in this segment.

Conversely, submarket 40 (European premium EVs) shows negative sentiment (−0.127) despite similar luxury positioning, with consumers expressing dissatisfaction with interface design and software performance, as reflected in topic terms like “unsatisfactory,” “outdated,” and “flawed.” This perception gap between submarkets 49 and 40 highlights a competitive advantage for Tesla in software experience and interface design, which could be further emphasized in marketing communications. In submarket 50 (emerging Chinese EVs), design aspects dominate consumer discussions, particularly concerning exterior styling and cabin innovation. The strongly positive sentiment (0.647) suggests that these brands successfully meet design expectations, potentially challenging Tesla's historical advantages in this dimension.

Across all three EV submarkets (40, 49, and 50), battery performance and driving range consistently emerge as key concerns, as indicated by terms such as “battery,” “range,” and “charging.” This recurring topic underscores the fundamental importance of driving range and charging infrastructure for consumer EV adoption, regardless of price point or brand positioning. These submarket-specific perception maps provide Tesla managers with precise guidance for product development prioritization and marketing message refinement. For example, maintaining technology leadership in automation and interface design appears critical in submarket 49, while addressing any perception gaps in exterior design may become increasingly important relative to competitors in submarket 50. Complementing these perception insights, Appendix I examines user preference patterns across submarkets.

Figure 7.

Two-dimensional principal component analysis (PCA) projection of the product vectors.

4.5 Prediction for new product

The Word2Vec component of our model enables the prediction of competitive positioning for new market entrants by leveraging their vector representations. Specifically, we follow the idea of Gabel et al. (2019) in estimating the vectors for new products. Specifically, we first build label sets for all products, such as product type (e.g., SUV, MPV, and sedan), energy type (e.g., gas-powered vehicle, BEV, HEV, and PHEV³), and price range (e.g., $0–$20,000, $20,000–$30,000). Then, we use linear combinations of other product vectors to infer the vectors of new products. To illustrate this process, consider the Benz_EQC as a new market entrant. Its vector representation can be approximated by combining vectors of existing products that share similar attributes or represent specific feature transitions:

ϖ_{Benz_EQC} \approx ϖ_{Audi\_e - tron} - ϖ_{Audi\_Q7} + ϖ_{Benz\_GLC}

In this formulation, the term $ϖ_{Audi\_e - tron} - ϖ_{Audi\_Q7}$ retains its luxury-oriented characteristics as the luxury brand transitions from traditional gas-powered vehicles to EVs. By adding $ϖ_{Benz\_GLC}$ , we integrate Benz-specific brand equity and its corresponding price-range characteristics into the new vector. As shown in Figure 7, the PCA projection indicates that the calculated vector for Benz_EQC is close to the true vector learned by Word2Vec. This validates the effectiveness of using linear combinations for product embedding.

To predict the competitive submarket of a new product, we calculate cosine similarity scores (SSs) between its vector and all submarket vectors $τ$ learned by our model. A higher similarity score indicates a greater likelihood that the new product will be assigned to this submarket. Due to the absence of user-new product interaction data (e.g., user-commented products), we cannot directly feed the new product vector into the model to predict its competitive strength. To overcome this challenge, we use the softmax function to compute the probability that the new product is associated with each submarket, similar to Equation (5). This predictive approach enables pre-launch competitive assessment, a significant advance over traditional methods that require market presence and consumer-interaction data to determine competitive positioning.

Figure 8 visualizes these SSs between new vehicles (released after 2022) and potential target submarkets. Based on this figure, we can identify the submarket assignment for each new car and determine which cars are likely to enter a specific submarket. For example, the NIO_ET5, positioned as a mid-to-high-end EV, has a high probability of being assigned to EV submarkets 40 (SS = 0.6246), 49 (SS = 0.7236), and 50 (SS = 0.7852). In submarket 49 (shown in a blue rectangle), we identify several potential entrants, including NIO_ET5 (SS = 0.7236), Xiaomi_S7 (SS = 0.5150), and BMW_i3 (SS = 0.4252). In response to the new entrants to submarket 49, Tesla's managers should implement a proactive strategy to maintain and strengthen the brand's competitive position.

Figure 8.

The heat map of the similarity scores between the new cars and submarkets.

Table 3 presents the predicted submarkets, competitive strengths, and ranking results for two new cars. We also list the top 5 representative cars for each predicted submarket. From the table, we observe that the BMW_i3 exhibits relatively low competitive strength (0.0068) in submarket 49, placing it near the bottom with a rank of 17. In contrast, the NIO_ET5 demonstrates significantly higher competitive strength (0.0433) in the same submarket, obtaining a more favorable rank of 7. This disparity in competitive strength and ranking indicates that the NIO_ET5 is perceived as a more prominent entrant in submarket 49. NIO_ET5 may exert a greater influence on consumer decision-making in this submarket, thereby attracting more attention and capturing a larger share of market demand. For Tesla's managers, it is crucial to monitor the market dynamics surrounding the NIO_ET5 and respond with tailored strategies to counter its influence in submarket 49.

These results quantify the potential competitive impact of new entrants, providing a measure of both their submarket membership probability and their projected competitive strength in those submarkets. To further understand how such competitive relationships evolve, we also developed a dynamic extension of the model. As detailed in Appendix J, the dynamic analysis tracks the evolution of EV submarkets from 2017 to 2021 and highlights the model's ability to capture not only the current competitive landscape but also the temporal dynamics of market competition. For example, while Tesla's luxury submarket was initially challenged by high-end gasoline vehicles (2017–2018), it later faced intense rivalry from domestic EV entrants like NIO and LiAuto (2019–2021). This evolution is accompanied by a clear transition in consumer discourse. Specifically, the focus has shifted from fundamental concerns like “battery endurance” and “price” toward new attributes such as “intelligent technology” and “resale value.”

4.6 Model validation

The empirical analyses in Sections 4.2 to 4.5 demonstrate that HLDA produces interpretable competitive structures, meaningful consumer perceptions, and reliable predictions for new market entrants. A natural question, however, is whether these insights stem from the principled integration of all three data sources, that is, user favorites, user-commented products, and textual comments, or whether a simpler, single-source model would yield comparable results. To address this, we conduct ablation studies that isolate the contribution of each component.

We construct five ablation variants by selectively removing or replacing components of HLDA:

HLDA-noLF: Removes latent feature learning from user favorites (i.e., $s_{u n} = 0$ ), relying only on user-commented products and textual comments to infer submarkets.

HLDA-noTC: Removes textual comments, combining only user-commented products and user favorites.

HLDA-noTC-SC: Removes textual comments and switches the roles of the two components, applying Word2Vec to user-commented products and LDA to user favorites.

HLDA-F (User Favorites Only): Removes textual comments and sets $s_{u n} = 1$ , using user favorites alone.

HLDA-C (User-Commented Products Only): Removes textual comments and sets $s_{u n} = 0$ , using user-commented products alone.

All variants share the same hyperparameters as the full model. For variants involving Word2Vec, we maintain a context window of 5, a latent dimension of 200, and 15 training epochs. We evaluate each variant using the coherence score (Mimno et al., 2011)⁴ and partial perplexity (Huang et al., 2018), selecting the optimal number of submarkets for each variant based on partial perplexity. To assess statistical significance, we conduct 10 independent runs per variant and perform paired t-tests against the full model. Results are summarized in Table 4, organized by signal complexity, from single-signal baselines to pairwise integrations to the full triadic model.

Table 3.
The predicted submarkets, competitive strengths, and ranking results for two new cars.

New Car Submarket prediction Similarity score Associated cars in the submarket Competitive strength Rank

BMW_i3 40 0.6905 Audi_e-tron, Benz_EQC, BMW_iX3, Volkswagen_ID3, Lexus_UX 0.0452 4

50 0.5862 NIO_ES8, NIO_ES6, Xpeng_G3, Xpeng_P7, Tesla_Model_Y 0.0169 15

49 0.4252 Tesla_Model_3, Tesla_Model_S, Tesla_Model_X, BYD_Han, NIO_ES6 0.0068 17

NIO_ET5 50 0.7722 NIO_ES8, NIO_ES6, Xpeng_G3, Xpeng_P7, Tesla_Model_Y 0.0832 3

49 0.7236 Tesla_Model_3, Tesla_Model_S, Tesla_Model_X, BYD_Han, NIO_ES6 0.0433 7

New Car	Submarket prediction	Similarity score	Associated cars in the submarket	Competitive strength	Rank
BMW_i3	40	0.6905	Audi_e-tron, Benz_EQC, BMW_iX3, Volkswagen_ID3, Lexus_UX	0.0452	4
50	0.5862	NIO_ES8, NIO_ES6, Xpeng_G3, Xpeng_P7, Tesla_Model_Y	0.0169	15
49	0.4252	Tesla_Model_3, Tesla_Model_S, Tesla_Model_X, BYD_Han, NIO_ES6	0.0068	17
NIO_ET5	50	0.7722	NIO_ES8, NIO_ES6, Xpeng_G3, Xpeng_P7, Tesla_Model_Y	0.0832	3
49	0.7236	Tesla_Model_3, Tesla_Model_S, Tesla_Model_X, BYD_Han, NIO_ES6	0.0433	7

Table 4.

Ablation analysis of multisource integration in HLDA.

Category	Model	Data sources	Coherence (Mean ± SD)	Improvement of coherence	Perplexity (Mean ± SD)	Improvement of perplexity
Single-signal	HLDA-F	User favorites	−519.29*** (±6.77)	5.30%	1682.91*** (±22.69)	3.87%
Single-signal	HLDA-C	User-commented products	−521.39*** (±7.52)	5.68%	1687.81*** (±19.52)	4.15%
Pairwise	HLDA-noLF	User-commented products + textual comments	−531.36*** (±7.44)	7.45%	1718.45*** (±20.58)	5.86%
	HLDA-noTC	User-commented products + user favorites	−506.71** (±7.17)	2.95%	1679.48*** (±22.43)	3.68%
	HLDA-noTC-SC	User-commented products + user favorites	−503.89** (±8.13)	2.41%	1675.48*** (±17.75)	3.45%
Full model	HLDA (Proposed)	User-commented products + user favorites + textual comments	−491.75 (±6.35)	-	1617.75 (±18.65)	-

Notes. All the results are reported as mean ± standard deviation. The variance in the paired t-test is computed as the differences between the paired observations. Improvement reports the percentage by which the full HLDA model outperforms each variant. HLDA = hierarchically latent Dirichlet allocation; HLDA-F = HLDA of user favorites only; HLDA-C = HLDA of user-commented products only. *p < .05; **p < .01; ***p < .001. The number of paired observations in the paired t-test is 10.

Table 4 shows that the full triadic model performs best across both evaluation metrics, showing that each source provides incremental information. The results also indicate that not all pairwise combinations are equally informative: pairwise models that retain favorite-derived latent features perform substantially better than the variant that removes them. Specifically, among single-signal baselines, HLDA-F and HLDA-C perform comparably, with coherence scores 5.30% and 5.68% below the full model, respectively. HLDA-F slightly outperforms HLDA-C, consistent with the view that favorites may reflect more serious consideration, while commented products may include exploratory mentions not representative of genuine competitive interest (Yang et al., 2015).

Among pairwise integrations, HLDA-noLF, which omits user favorites entirely, performs worst overall, with coherence and perplexity deteriorating by 7.45% and 5.86%, respectively. This result shows that the latent features learned from user favorites are the single most impactful component, as they provide a structured competitive embedding that neither commented products nor textual comments can replicate on their own. By contrast, HLDA-noTC and HLDA-noTC-SC, both of which retain user favorites but remove textual comments, show only moderate degradation (2.95% and 2.41% in coherence, respectively) and perform similarly. This suggests that rearranging how favorites and commented products are combined adds little value; what matters is their joint inclusion alongside textual comments.

The full HLDA model outperforms all variants (p < .01), showing that user favorites, user-commented products, and textual comments each supply complementary signals that jointly enable more precise submarket identification. User favorites anchor the competitive embedding; commented products capture broader co-occurrence patterns; and textual comments strengthen within-submarket topic coherence by surfacing the attributes that drive product comparisons. Removing any one source measurably degrades performance; the full integration performs well.

Beyond these ablation studies, we conduct additional evaluations to further establish the robustness and predictive validity of our model. In Appendix K, we perform simulation-based parameter recovery checks using synthetic datasets with known ground truth, reporting recovery distributions for the sum of absolute errors and KL divergence. The results show that our estimation procedure reliably recovers the true latent parameters. Appendix L benchmarks HLDA against state-of-the-art baselines across coherence scores, silhouette coefficients, and perplexity. Our model outperforms these baselines, indicating that the gains from multi-signal integration are not simply a function of the Word2Vec embedding itself but arise from the full HLDA architecture. Finally, we conduct external validation against real-world data sources, including consumer reports, sales data, and human-rated benchmarks. The results show higher PMI scores, better alignment with observed sales patterns, and higher human-evaluation ratings than all baselines, providing evidence that the inferred submarkets reflect genuine competitive relationships rather than statistical artifacts.

5. Managerial implications

Competitive intelligence plays a pivotal role in informing strategic decision-making in operations management. Our integrated analysis offers insights that enhance operational decisions.

First, our study helps manufacturers and managers analyze market competition from both macro and micro perspectives. From the macro perspective, our approach enables managers to understand the overall competitive landscape. Visualizing submarkets and their relationships helps managers identify opportunities and threats not only in primary submarkets but also in adjacent segments. For example, our analysis reveals clear segmentation in the automotive market, with distinct but interconnected submarkets for traditional luxury vehicles, premium EVs, and emerging EV brands. From the micro perspective, manufacturers and managers can leverage our results to identify the competing products in each submarket and evaluate their respective competitive strengths. As demonstrated through the Tesla case in Section 4, its managers can detect both direct competitors and emerging threats that might be missed through traditional analysis.

Second, this study provides managerial support for product marketers and online platforms. Specifically, our model is designed to obtain competitive intelligence by analyzing latent patterns of individual user behavior (i.e., favorite and comment behaviors). Compared with the literature (Matthe et al., 2023; Ringel, 2023; Ringel and Skiera, 2016), our model can provide insights into user perceptions across submarkets. Product marketers can leverage user perceptions to identify the latent aspects users prioritize most, then create a more compelling value proposition that resonates with users and elicits positive responses. In addition, from the derived user perceptions, product marketers can proactively address negative sentiments and take corrective actions to maintain a positive product image. In addition, online platforms can leverage user perceptions to provide recommendations. For example, the platform can generate interpretable rationales by highlighting which perceived attributes contribute most to a given recommendation, thereby increasing user trust and engagement.

6. Conclusion

This study proposes an interpretable machine learning framework for deriving competitive intelligence from multifaceted user behavior data. By jointly inferring market structure and consumer perceptions, HLDA integrates three complementary signals, including latent features from user favorites, co-occurrence patterns from user-commented products, and attribute-level discourse in textual comments. Applied to a large-scale automotive dataset, the framework successfully identifies fine-grained competitive submarkets, characterizes rival product strengths, extracts consumer perceptions, and predicts competitive positioning for new entrants. Ablation studies and external validation show that the triadic integration performs best in the reported validation exercises and that the inferred submarkets align with real-world competitive relationships, as evidenced by sales patterns and human-rated benchmarks.

This work contributes to the operations management and marketing literature in two ways. First, it demonstrates the value of integrating multiple UGC types rather than treating each in isolation for competitive analysis. Second, it provides an interpretable methodological template that researchers and practitioners can adapt to other platforms and product categories where complementary behavioral signals exist.

Several directions remain open. First, the current framework treats all commenting behavior symmetrically. Future work could incorporate sentiment directly into the estimation process to distinguish favorable evaluation from critical comparison, which may sharpen submarket boundaries and perception maps. Second, the bag-of-words assumption limits the model's ability to link individual products to their specific associated textual comments. Future work could address this limitation by incorporating alignment mechanisms, such as attention-based models or contextual language representations, to capture semantic dependencies between products and their corresponding comments. More broadly, examining how our framework performs across platforms with different engagement structures, or in markets with faster product cycles, would further establish the generality of the multi-signal integration approach.

Supplemental Material

sj-pdf-1-pao-10.1177_10591478261457235 - Supplemental material for Deriving competitive intelligence from multifaceted user behavior data: An interpretable machine learning framework

Supplemental material, sj-pdf-1-pao-10.1177_10591478261457235 for Deriving competitive intelligence from multifaceted user behavior data: An interpretable machine learning framework by Yang Qian, Hai Che, Yezheng Liu, Yuanchun Jiang and Jennifer Shang in Production and Operations Management

Footnotes

Acknowledgments

The authors sincerely thank Editor-in-Chief Subodha Kumar, Departmental Editor Fred Feinberg, Senior Editor P. K. Kannan, and the anonymous reviewers for their valuable and insightful suggestions.

ORCID iDs

Yang Qian

Hai Che

Yezheng Liu

Yuanchun Jiang

Jennifer Shang

Funding

The authors received the following financial support for the research, authorship, and/or publication of this article: Yang Qian, Yezheng Liu and Yuanchun Jiang are supported by the National Natural Science Foundation of China (72101072, 72342011, 72571096, 72271084, 72171071), and the Key Laboratory of Philosophy and Social Sciences for Cyberspace Behaviour and Management.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online (doi: ).

Notes

How to cite this article

Qian Y, Che H, Liu Y, Jiang Y and Shang J (2026) Deriving competitive intelligence from multifaceted user behavior data: An interpretable machine learning framework. Production and Operations Management XX(XX): 1–17.

References

Bastian

Heymann

Jacomy

(2009) Gephi: An open source software for exploring and manipulating networks. Proceedings of the International AAAI Conference on web and Social media 3: 361–362.

Bernstein

Modaresi

Sauré

(2019) A dynamic clustering approach to data-driven assortment personalization. Management Science 65(5): 2095–2115.

Blei

Jordan

(2003) Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 127–134.

Blei

Jordan

(2003) Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan): 993–1022.

Chen

Yin

, et al. (2020) Try this instead: Personalized and interpretable substitute recommendation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval: 891–900. doi:10.1145/3397271.3401042.

Cooper

Inoue

(1996) Building market structures from consumer preferences. Journal of Marketing Research 33(3): 293–306.

DeSarbo

Grewal

Wind

(2006) Who competes with whom? A demand-based perspective for identifying and representing asymmetric competition. Strategic Management Journal 27(2): 101–129.

Dieng

Ruiz

Blei

(2020) Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8: 439–453.

Gabel

Guhl

Klapper

(2019) P2V-MAP: Mapping market structures for large retail assortments. Journal of Marketing Research 56(4): 557–580.

10.

Griffiths

Steyvers

(2004) Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl 1): 5228–5235.

11.

Huang

Lehavy

Zang

, et al. (2018) Analyst information discovery and interpretation roles: A topic modeling approach. Management Science 64(6): 2833–2855.

12.

Igarashi

Zhang

Kannan

, et al. (2025) Identifying influential users by topic in unstructured user-generated content. Production and Operations Management 34(10): 3267–3288.

13.

Kumar

Saboo

Agarwal

, et al. (2020) Generating competitive intelligence with limited information: A case of the multimedia industry. Production and Operations Management 29(1): 192–213.

14.

Lau

RYK

Zhang

(2018) Parallel aspect-oriented sentiment analysis for sales forecasting with big data. Production and Operations Management 27(10): 1775–1794.

15.

(2020) Charting the path to purchase using topic models. Journal of Marketing Research 57(6): 1019–1036.

16.

Liu

Toubia

(2018) A semantic approach for estimating consumer content preferences from online search queries. Marketing Science 37(6): 930–952.

17.

Liu

Wang

Fang

, et al. (2025) The impact of verbal and visual content on consumer engagement in social Media marketing. Production and Operations Management 34(11): 3416–3437.

18.

Liu

Jiang

Zhao

(2019) Assessing product competitive advantages from the perspective of customers by mining user-generated content on social media. Decision Support Systems 123: 113079.

19.

Liu

Qian

Jiang

, et al. (2020) Using favorite data to analyze asymmetric competition: Machine learning models. European Journal of Operational Research 287(2): 600–615.

20.

Liu

Qin

C-X

Zhang

Y-J

(2021) Mining product competitiveness by fusing multisource online information. Decision Support Systems 143: 113477.

21.

Matthe

Ringel

Skiera

(2023) Mapping market structure evolution. Marketing Science 42(3): 589–613.

22.

Mikolov

Sutskever

Chen

, et al. (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26: 3111–3119.

23.

Mimno

Wallach

Talley

, et al. (2011) Optimizing semantic coherence in topic models. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics: 262–272.

24.

Nam

Joshi

Kannan

(2017) Harvesting brand information from social tags. Journal of Marketing 81(4): 88–108.

25.

Netzer

Feldman

Goldenberg

, et al. (2012) Mine your own business: Market-structure surveillance through text mining. Marketing Science 31(3): 521–543.

26.

Newman

Watts

Strogatz

(2002) Random graph models of social networks. Proceedings of the National Academy of Sciences 99(suppl_1): 2566–2572.

27.

Nguyen

Billingsley

, et al. (2015) Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3: 299–313.

28.

Ouyang

Guo

, et al. (2018) Competitivebike: Competitive analysis and popularity prediction of bike-sharing apps using multi-source data. IEEE Transactions on Mobile Computing 18(8): 1760–1773.

29.

Pant

Sheng

(2015) Web footprints of firms: Using online isomorphism for competitor identification. Information Systems Research 26(1): 188–209.

30.

Rakesh

Ding

Ahuja

, et al. (2018) A sparse topic model for extracting aspect-specific summaries from online reviews. Proceedings of the 2018 World Wide Web Conference: 1573–1582. doi:10.1145/3178876.3186069.

31.

Ringel

(2023) Multimarket membership mapping. Journal of Marketing Research 60(2): 237–262.

32.

Ringel

Skiera

(2016) Visualizing asymmetric competition among more than 1,000 products using big search data. Marketing Science 35(3): 511–534.

33.

Tirunillai

Tellis

(2014) Mining marketing meaning from online chatter: Strategic brand analysis of big data using latent dirichlet allocation. Journal of Marketing Research 51(4): 463–479.

34.

Valkanas

Lappas

Gunopulos

(2017) Mining competitors from large unstructured datasets. IEEE Transactions on Knowledge and Data Engineering 29(9): 1971–1984.

35.

Voleti

Kopalle

Ghosh

(2015) An interproduct competition model incorporating branding hierarchy and product similarities using store-level data. Management Science 61(11): 2720–2738.

36.

Liu

, et al. (2019) Sentiment lexicon enhanced neural sentiment classification. Proceedings of the 28th ACM International Conference on Information and Knowledge Management: 1091–1100. doi:10.1145/3357384.3357973.

37.

Yang

Toubia

De Jong

(2015) A bounded rationality model of information search and choice in preference measurement. Journal of Marketing Research 52(2): 166–183.

38.

Yang

Zhang

Kannan

(2022) Identifying market structure: A deep network representation learning of social engagement. Journal of Marketing 86(4): 37–56.

39.

Xia

Zhang

, et al. (2022) Harvesting online reviews to identify the competitor set in a service business: Evidence from the hotel industry. Journal of Service Research 25(2): 301–327.

40.

Zhuang

Gao

, et al. (2014) Probabilistic word selection via topic modeling. IEEE Transactions on Knowledge and Data Engineering 27(6): 1643–1655.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

3.08 MB

0.00 MB