Factoring textual reviews into user preferences in multi-criteria based content boosted hybrid filtering (MCCBHF) recommendation system

Abstract

Recommendation systems help customers to find interesting and valuable resources in the internet services. Their priority is to create and examine users’ individual profiles, which contain their preferences, and then update their profile content with additional features to finally increase the users’ satisfaction. Specific characteristics or descriptions and reviews of the items to recommend also play a significant part in identifying the preferences. However, inferring the user’s interest from his activities is a challenging task. Hence it is crucial to identify the interests of the user without the intervention of the user. This work elucidates the effectiveness of textual content together with metadata and explicit ratings in boosting collaborative techniques. In order to infer user’s preferences, metadata content information is boosted with user-features and item-features extracted from the text reviews using sentiment analysis by Vader lexicon-based approach. Before doing sentiment analysis, ironic and sarcastic reviews are removed for better performance since those reviews inverse the polarity of sentiments. Amazon product dataset is used for the analysis. From the text reviews, we identified the reasons that would have led the user to the overall rating given by him, referred to as features of interest (FoI). FoI are formulated as multi-criteria and the ratings for multiple criteria are computed from the single rating given by the user. Multi-Criteria-based Content Boosted Hybrid Filtering techniques (MCCBHF) are devised to analyze the user preferences from their review texts and the ratings. This technique is used to enhance various collaborative filtering methods and the enhanced proposed MCKNN, MCEMF, MCTFM, MCFM techniques provide better personalized product recommendations to users. In the proposed MCCBHF algorithms, MCFM yields better results with the least RMSE value of 1.03 when compared to other algorithms.

Keywords

Multi-criteria Content Boosted Hybrid filtering product recommendation user preferences text reviews collaborative filtering

1. Introduction

Online shopping is ubiquitous; but online stores, while eminently searchable, lack the same browsing options as the brick-and-mortar variety. Visiting a bookstore in person, a customer can wander over to the science fiction section and casually look around without a particular author or title in mind. Online stores often offer a browsing option, and even allow browsing by genre, but usually the number of options available are still overwhelming [1]. Commercial sites try to counteract this overload by showing special deals, new options and favorites, but the best marketing angle would be to recommend items that the user is likely to enjoy or need. Recommendation systems (RS) mainly focus on accuracy, scalability and cold-start problems. Apart from these, explainability and transparency also play essential roles in increasing the user’s trust and satisfaction.

Recently Latent Factor Models (LFM) and deep learning models in collaborative filtering have demonstrated good prediction accuracy. Deep learning models are essentially black boxes, where the developer has no control over the features or aspects extracted from the given input [7]. Most of the LFM models use user-item ratings alone for making the recommendation. Text reviews contain a lot of information that gives us details about user preferences. By analyzing the text reviews, explanations can be provided for recommendations. Using sentiment analysis, features in which the user would be interested while purchasing a product and the features for which a purchased item is rated well are extracted from the text reviews [16].

We aimed to provide a hybrid product recommendation system that infers the preferences of the users from their text reviews in the form of Features of Interest (FoI). Important features of interest are formed as multiple criteria (MC) and the ratings for multi-criteria are inferred from the single user-item rating using Multi-Criteria Decision Making (MCDM) technique. Inferred multi-criteria ratings and metadata are used to find similar users in collaborative filtering techniques. These proposed Multi-Criteria based Content Boosted Hybrid Filtering (MCCBHF) techniques are used to predict the recommendations based on the single ratings, metadata features, user’s features, item’s features and inferred MC ratings. Collaborative Filtering (CF) methods such as Tensor Factorization Model (TFM), Matrix Factorization (MF), Factorization Machine (FM) and K-Nearest Neighbour (KNN) are boosted with metadata and inferred multi-criteria ratings to form multi-criteria based KNN (MCKNN), multi-criteria based TFM (MCTFM), multi-criteria based explicit matrix factorization (MCEMF) and multi-criteria based FM (MCFM). It also provides an explanations for the recommendations (FoI) that automatically increases the user satisfaction on purchasing the products.

This paper is aimed to infer the preferences of the user from the single rating given by the user with the help of the interests extracted from the textual reviews given by the user. The user interests are formulated as the multiple criteria for ratings and the multiple ratings are infered from the single rating given by the user. The rest of the paper is organized as follows. Section 2 describes the work related to the recommendation systems. The concepts of multi-criteria recommendation systems, explainable RS, need for irony detection are discussed in Section 3. Section 4 explains the proposed system for product recommendation. Results are discussed in Section 5. Conclusion in Section 6 has pointers to possible future enhancements

2. Related work

The importance of textual reviews in recommendation systems is stressed in [33] that contains both user-oriented information (user opinions) and product-oriented information (product features). Zhang attempts to incorporate textual reviews over the yelp dataset to tackle cold-start problems, the explanation of recommendation and automatic generation of user or item profiles. The phrase-level sentiment analysis framework over textual review corpus is proposed in [34] and they have generated a sentiment lexicon for a personalized recommendation. They have designed an Explicit Factor Model to give feature-level explanations for both recommended and non-recommended items with the generated personalized recommendation.

Two novel methods, such as the word-based method and LDA-based method are used for discovering all the contextual information quickly and efficiently across different applications from user-generated reviews [4]. The authors of [14] have introduced a scalable and practical lexicon-based approach, which performs well in terms of speed and accuracy for extracting sentiments using emoticons and hashtags methods.

Topic Matrix Factorization Model is proposed in [2] that combines the idea of Matrix Factorization (MF) for rating prediction and Non-Negative Matrix Factorization Model (NMF) for uncovering latent topic factors in review texts. These two tasks are related by designing the A-T (Addition Transform) or M-T (Multiplication Transform) functions to align the topic distribution parameters with the corresponding latent user and item factors.

Non-linear relationships among users and items are solved using a neural network-based recommendation model (NeuRec) in [32]. It uses a neural network model to untangle the complexity in user-item interaction and its latent factors. Even though the overall I-NeuRec and U-NeuRec models perform well, the NeuRec model with a pairwise learning approach performs poorer in terms of ranking quality. This pairwise learning approach is used to maximize the difference between positive and negative items.

Various features like heuristic terms, classification terms, heuristic aspects and hierarchy aspects are extracted from users reviews and used to find similarity in k-NN algorithm by [8]. The metadata features of the item are not used for making recommendations.

The authors of [19] use LDA method to learn the latent review topics and LFM to learn latent rating dimensions. Sentiment dictionary is used to predict the sentiment score from reviews and Probabilistic Matrix Factorization (PMF) is used to predict the rating by [27]. DeepCoNN model is developed with two parallel Convolutional Neural Networks (CNN) networks to learn user features and item features and used a shared layer to learn the interactions between them [35]. DeepCoNN, reviews are first processed by two CNNs to learn user’s and item’s representations, which are then concatenated and passed into a regression layer for rating prediction. A limitation of DeepCoNN is that it requires reviews in the testing phase, which is not present in most cases.

Aspect-aware LFM and aspect-aware topic model to learn the latent factors and latent topics for making the recommendation is proposed in [6]. The performance of DeepCoNN decreases significantly when reviews are unavailable in the testing phase. TransNet applies neural networks, which has exhibited strong capabilities on representation learning, in reviews to learn users’ preferences and items’ characteristics for rating prediction. However, it may suffer from noisy information in reviews, which would deteriorate the performance; and errors introduced when generating fake reviews for rating prediction, which will also cause bias in the final performance as mentioned in [6].

Neural Attentional Regression model with Review-level Explanations (NARRE) is discussed to predict the ratings and also the usefulness of the reviews in providing a better recommendation in [5]. A method named global and local tensor factorization model is used to jointly learn a global predictive model and multiple local predictive models to capture the overall rating behaviour and also the diverse rating behaviours of users, respectively by [31]. Inorder to perform this, they have used the explicit predefined multi-criteria ratings already obtained from the users in addition to the single overall rating.

While metadata information plays a vital role in identifying user preferences, much of the work surveyed in this section does not take advantage of it. Explicit and implicit feedback along with the metadata can be explored and exploited to infer users’ interests. Recommending items based on user’s interest helps increase user satisfaction and thus increase profit and sales. Also, most of the recommendation systems obtain a single explicit rating from the user, yet we need ratings for various criteria to understand the preferences better. Hence there is a need to infer multi-criteria ratings from the single user-item rating. Different collaborative filtering techniques can be boosted with the content features from text reviews and metadata, and their performance can be analyzed for providing better personalized recommendations.

3. Methodologies used in the proposed approach

3.1. Multi-criteria recommendation system

A user always gives a single rating for each item. Yet, his rating may be based on only some features which are of interest to him. Another user may have assigned the same rating to the item, but the features he is interested in may be different. Therefore, ratings alone or metadata about the items alone are not adequate to know the interest of the user so as to provide a personalized recommendation [31]. However, we can infer the interest of the users from implicit feedback such as pages visited, click rates or time spent on an item’s page. Moreover, the explicit text reviews given by the user are a potential source from where we can infer his interest [12]. Although reviews are sparser than ratings, they provide more detailed and reliable information about the user’s preferences and interests. Personalization can be improved by adding the review text information in addition to ratings and metadata. For example, in a text review of a cellphone, if a user talks about the battery life and charging time, he is more interested in the battery than other qualities of the cell phone such as display, camera, style, and cost. Thus, we identified user preferences from the ratings and text reviews.

Using sentiment analysis on the text reviews given by the user, we identified the Features of Interest (FoI) features of the product the user is interested in. We then form the User-Feature Correlation Matrix (UFCM) that, for each user, assigns a weight to each feature of the item. Similarly, text reviews of each item is analyzed to find out the degree to which each feature is liked or disliked by the overall user community. Item-Feature Correlation Matrix (IFCM) thus formed contains positive or negative sentiment for the features of each item. Ratings for multiple criteria are computed from the single overall rating given by the user. This multi-criteria rating matrix, when used to provide a recommendation, will find similar users of the same interest more effectively than with a single-criterion rating matrix. The user who concentrates more on the battery life will have similar users who consider battery life as important when compared to other features.

Table 1
Multi-criteria rating matrix

Features: Display resolution, Camera quality, Battery life, Cost

Item1 Item2 Item3 Item4 Item5

User1 5 7

2,2,8,8 5,5,9,9

User2 5

8,8,2,2

User3 6

3,3,9,9

User4 6 9 3

4,4,8,8 8,8,10,10 1,3,4,0

Table 1 shows the multi-criteria ratings for five items given by 4 users. Features of cellphones such as display resolution, camera quality, battery life and cost are taken into consideration. User 1 and 2 gave rating 5 for item 1. Here user 1 mainly concentrates on battery and cost while user 2 concentrates on display and camera quality. Hence their interests are not matched. Even though both users give same rating, their interests mismatch and they are not considered as similar users or neighbours. Hence item 3 will not be the correct recommendation for user 2. Multi-criteria decision making is used to analyze and split the given single user-item rating among the various criteria that the user likes. These inferred multi-criteria ratings and the product metadata features are used in Multi-Criteria based Content Boosted Hybrid Filtering (MCCBHF) to achieve better personalized explainable recommendations to the user.

3.2. Explainable recommendation systems

A hybrid approach is proposed to give a personalized recommendation with explanation to users. Instead of giving simply recommendations, the user will be given items preferred by him or suggested by similar users. This will increase the satisfaction of the user, and the effectiveness of the system. The text reviews are analyzed for sarcastic and ironic content and those reviews are removed from further processing as they will affect the recommendation adversely. Then the features which are of interest to the user while purchasing a particular product are extracted by analyzing their text reviews. From the text reviews, a sentiment lexicon is constructed using Natural Language Processing (NLP) by analyzing users’ sentiments on the extracted features for a personalized recommendation. The features are extracted as triplets $(F, O, S)$ , where F is a feature of the product, O an opinion word, and S the polarity of sentiment (positive, negative or neutral). Then the user-feature and item-feature correlation are identified from it. Lastly, Multi-Criteria based Content Boosted Hybrid Filtering (MCCBHF) techniques are used to integrate the item features, user’s ratings and reviews and generates an explainable recommendation that optimizes trust among users.

3.3. Ironic and sarcastic content analysis

Humans have a natural ability to identify the sentiment or the irony intent of reviews or comments. This ability starts from childhood and develops as we grow and interact with others. However, for the machine, identifying the intention of a user is a rather difficult task. Sentiment analysis plays a vital role in e-Commerce. In order to increase the accuracy of sentiment analysis, identifying ironic and sarcastic content in the text is necessary [3]. This information plays a vital role in inferring the actual intention of the user.

It is very important to identify the ironic content in the given text since that will inverse the polarity of the sentiment inferred [10,12]. Hence automatic detection of ironic and sarcastic content is necessary. Recently the growth of the Internet and the vast amount of data available in the Internet have made this detection possible. Internet has influenced our daily life events. We rely on the Internet data for buying items, watching movies and searching for any information. Social media comments and reviews plays also a main role in e-commerce. People read the reviews about a particular product before deciding to purchase it. Currently, analyzing the sentiment of reviews is an active area of research.

Ironic content in the text affects the polarity of the sentiment inferred. It gives the opposite meaning of what it actually meant. Therefore it can be called as a polarity reverser [18]. Irony is studied by various disciplines like linguistics, philosophy and psychology. The frequent use of ironic text in social media gains importance in natural language processing tasks but faces difficulty in achieving high-performance [17,30]. The potential applications of irony detection include text mining, author profiling, detecting online harassment and sentiment analysis [29].

Fig. 1.

Architecture of explainable product recommendation system.

4. System overview

We have proposed a system that extracts the features of interest (FoI) from text reviews using sentiment analysis and formulate the multi-criteria ratings based on the extracted interesting features. These multi-criteria ratings and metadata are used to provide a personalized recommendation system. Figure 1 shows the architecture of the system. The proposed system consists of the following modules: irony and sarcasm detection and removal, feature extraction using sentiment analysis, item-feature and user-feature correlation matrices generation and recommendation model. User-item ratings and text reviews obtained from the users and the metadata features are the input to the system. The text reviews are supplied to the irony and sarcasm detection and removal module to remove the ironic and sarcastic review contents. This cleaned data is given to the sentiment analysis module to extract the features in the form of triplet (Feature, Opinion, Sentiment) list from which the item-feature and user-feature correlation matrices (IFCM, UFCM) are formed. The newly constructed IFCM and UFCM matrices along with the user ratings and the metadata content features are used to infer the multi-criteria ratings and give personalized recommendations using the proposed MCCBHF algorithms.

4.1. Irony and sarcasm detection and removal

We participated in SemEval-2018 for task 3 (irony detection) and developed a system named SSN MLRG1, using machine learning approach with Twitter data [22]. A rule-based approach was used for feature selection and Multilayer Perceptron (MLP) technique was used to build the model for ironic classification subtask. As an extension of that system in addition to Twitter irony dataset, SarcasmCorpus dataset of the product reviews was also used for irony and sarcasm detection purposes. Data was cleaned and processed using the NLTK toolkit functions. The keywords used for irony and sarcasm detection were identified using rule-based feature selection technique. The selected features were formed as a Bag of Words (BoW) dictionary. For each sentence, feature vectors were generated by a one-hot encoding method using the sentence keywords and BoW dictionary. The feature vectors were given to the machine learning and deep learning models to classify the review text as irony or regular. The features in BoW were manually evaluated for correctness. Since the ground truth information about the features were not available we could not use the metrics to evaluate the correctness. Some example words in BoW are humorous, arrogant, mocking, ridiculous, beautiiiifulll, fantasssstic etc.,

Multilayer Perceptron (MLP) and Deep Learning models like Long Short Term Memory networks (LSTM) and Bidirectional Encoder Representations from Transformers (BERT) were used for this classification. Algorithm 1 shows the various steps for irony and sarcasm detection and removal. The system comprises the following modules: data extraction, preprocessing, rule-based feature selection, feature vector generation and machine learning model for classification. The model was built with sarcasm and twitter irony dataset and was used to remove the ironic and sarcastic reviews in Amazon product dataset. The review text and summary of the review were used to identify the ironic or sarcastic content. If the predicted label of the product review is ironic or sarcastic, then that review is removed from the dataset.

Algorithm 1:

Irony and sarcasm detection and removal

Table 2

Parts of speech categories

Abbreviation	Parts of Speech
VB	Verb, base form
VBZ	Verb, 3rd person sing. present
VBP	Verb, non 3rd person sing. present
VBD	Verb, past tense
VBG	Verb, gerund/present participle
VBN	Verb, past participle
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
NN	Noun, singular
NNP	Proper noun, singular
NNS	Noun plural
NNPS	Proper noun, plural

4.2. Feature extraction using sentiment analysis

After removing the sarcastic text from the dataset, the interesting features of users and items are extracted using sentiment analysis [23]. Feature extraction is described in the form of $(F, O, S)$ lexicon list by applying a series of steps, where F is a feature of the product, O the opinion word, and S the polarity of sentiment. The text reviews are given as input to analyze the sentiment. The stream of text is tokenized into phrases, symbols, words, or other meaningful elements. The lists of tokens are used as input for further processing, such as text mining or parsing. Stop-words are filtered out from the review text before further processing the natural language data. The stop-word list is determined by sorting the terms based on collection frequency for identifying the number of times each word occurs in the document collection. It helps in reducing the time for indexing stop-words and the number of postings.

Stanford POS tagger is used to parse each phrase which yields the POS tag for each word. Words tagged as a noun are used to represent the product features, whereas words tagged as adjective, verb, or adverb are used for identifying the opinion words from the context. After applying POS-tagging, features and their opinions are extracted using the pattern knowledge. Sentiment is identified using the Vader lexicon-based approach [13]. The resulting patterns are represented by the sentiment lexicon $(F, O, S)$ list. For example, from the text review “sound is good”, the entry (sound, good, positive) in the form of $(F, O, S)$ -tuple is extracted by feature extraction module, where F defines the feature of the product, O the opinion word expressing the feeling of the user about the feature or product and S the polarity of sentiment that can be positive, negative, or neutral.

Table 3 shows the various pattern relations and examples for positive and negative reviews. Algorithm 2 shows the steps for extracting features from review text in the form of (feature, opinion, sentiment) triplets, creation of User-Feature Correlation Matrix (UFCM) and Item-Feature Correlation Matrix (IFCM), generating rating for multiple criteria from single user rating using Multi-Criteria Decision Making (MCDM) and generating recommendations using MCCBHF algorithms.

Table 3
Lexicon construction using POS-tagging

Pattern Pattern Relation Lexicon Construction (Positive/Negative)

Pattern 1 (Adj(JJ), Noun(NN)) Wonderful phone, Short batterylife

Pattern 2 (Noun(NN), Verb(VBD)) Sound good, Memory weak

Pattern 3 (Noun(NN), Verb(VBD), Noun(NN)) Camera look awesome, Picture worst quality

Pattern 4 (Adj(JJ), Noun(NN), Noun(NN)) High quality pictures, Expensive ink costs

Pattern 5 (Advb(RB), Advb(RB), Adj(JJ)) Not so difficult, Actually very difficult

Pattern 6 (Noun(NN), Noun(NN)) Compact camera, Battery consumption

Pattern	Pattern Relation	Lexicon Construction (Positive/Negative)
Pattern 1	(Adj(JJ), Noun(NN))	Wonderful phone, Short batterylife
Pattern 2	(Noun(NN), Verb(VBD))	Sound good, Memory weak
Pattern 3	(Noun(NN), Verb(VBD), Noun(NN))	Camera look awesome, Picture worst quality
Pattern 4	(Adj(JJ), Noun(NN), Noun(NN))	High quality pictures, Expensive ink costs
Pattern 5	(Advb(RB), Advb(RB), Adj(JJ))	Not so difficult, Actually very difficult
Pattern 6	(Noun(NN), Noun(NN))	Compact camera, Battery consumption

Algorithm 2:

Feature extraction and multi-criteria recommendation

4.3. User–feature and item–feature correlations

In feature analysis, the user–feature correlation matrix ( $UFCM$ ) and item–feature correlation matrices ( $IFCM$ ) are formed. Different users might care about different features based on their priority, and they tend to comment more frequently on those features that he particularly cares for. Similarly, different products have several kinds of features. From the sentiment lexicon, the features are extracted and paired with users and items. For example, if the user has given the review on a particular phone as “The sound is good in product P, but the picture quality is very bad”, then the user is interested in the sound and picture quality of product P. It is recorded in the user-feature correlation matrix. Product P has good opinions on sound and bad opinions on picture quality. It is also recorded so in the item-feature correlation matrix. Likewise, all the user reviews are analyzed and the features matrices are consolidated.

User–feature correlation matrix entries are binary, 0 or 1, where 1 denotes that the user is interested about the feature, 0 denotes that the user is not interested in that feature. Item–feature correlation matrix measures the quality of an item for the corresponding product feature. Item–feature correlation matrix entries are ternary, $- 1$ , 0 or 1, where 1 denotes that there is positive sentiment about that feature, 0 denotes the feature is not present in the products and $- 1$ denotes that there is a negative sentiment about that feature. Entries in user–item rating matrix are ordinal in the range from 0 to 5. Values 1–5 denote weakest, weak, medium, strong and strongest like by the user, respectively. Unrated items are denoted by 0. More information can be extracted if the degree of interest is stored in the correlation matrix. For example, the user–feature correlation matrix will contain the number of times the user talked about that feature in his reviews. Item–feature correlation matrix will have the count of positive response users and the number of negative response users.

4.4. Multi-criteria based content boosted hybrid filtering (MCCBHF)

Various collaborative filtering techniques like Tensor Factorization Model (TFM) [15], Matrix Factorization (MF) [26], Factorization Machine (FM) [24] and K-Nearest Neighbour (KNN) are boosted with the content information extracted from the text reviews and the multi-criteria ratings imputed from them. User preferences inferred from the text reviews are used to improve the performance of the explainable recommendation system. Multi-criteria based K nearest neighbour (MCKNN) method is used to formulate the neighbourhood with the help of inferred multiple ratings. Traditional KNN finds similar users based on the overall ratings, whereas MCKNN finds similar users based on the multi-criteria ratings. Two users who rated an item with 5 rating may vary in their interest (FoI). Thus MCKNN computes the real interest among the users and identifies the like-minded neighbours for providing better recommendations where simple KNN fails.

Matrix factorization assumes the latent features of interest as two matrices $P_{mxf}$ and $Q_{fxn}$ , where m denotes number of users, n number of items and f number of latent features. Multi-criteria based explicit matrix factorization (MCEMF) initializes ${UFCM}_{mxf}$ and ${IFCM}_{fxn}$ as the explicit latent matrices for rating prediction. Multi-criteria based tensor factorization model (MCTFM) uses the inferred multiple ratings for various criteria as the data tensor. Users, items and features are the three dimensions of the tensor. Multi-criteria-based factorization machines (MCFM) considers the details of item interaction history, multi-criteria ratings and metadata features to identify the relationship between the features and ratings. Thus the memory and model-based CF methods are enhanced with the content information and preferences inferred from the text reviews. The content boosted CF methods provide better results when compared to the traditional methods.

4.5. Multi-criteria based explicit matrix factorization (MCEMF)

In a recommendation system such as Netflix or Amazon, there is a group of users and a set of items. Given that each user has rated some items in the system, the aim of the system is to predict how the users would rate the items that they have not yet rated, such that recommendations can be generated for the users. In this case, all the information about the existing ratings can be represented in the form of the matrix. Ratings are integers ranging from 1 to 5. The problem of predicting the rating for unrated items can be solved by identifying the latent features hidden in the ratings. These hidden features are used to determine how the user rates an item.

Fig. 2.

Matrix factorization.

Matrix Factorization (MF) is the process to factorize a given matrix, i.e., to find out two matrices such that multiplying them will give the original matrix. Matrix factorization is a technique that can be used to discover latent features underlying the interactions between two different kinds of entities [26]. The intuition behind using matrix factorization is to solve the problem of recommending items to users who have not rated those items, based on some latent features that determine how a user rates an item. Figure 2 shows the working of matrix factorization. $R_{mxn}$ is the rating matrix for m users and n items, which is split into the latent factor matrices $P_{mxf}$ and $Q_{fxn}$ , where f denote the latent features. Instead of assuming and starting with random latent features, features of the correlation matrix can be used as the latent features explicitly to predict a rating with respect to a certain user and a certain item.

Equation (1) shows the calculation of rating prediction ${\hat{r}}_{ui}$ and Eq. (2) shows the objective function (the error) which needs to be minimized. Equation (3) shows the calculation of rating prediction ${\hat{r}}_{ui}$ with mean rating μ, user bias $b_{u}$ and item bias $b_{i}$ added to it and Eq. (4) shows the objective function (the error with bias) to be minimized. Here, k is the set of the $(u, i)$ pairs for which $r_{ui}$ is known in the training set. The constant λ is used for regularization. $\begin{array}{l} (1) & {\hat{r}}_{ui} = Q_{i}^{T} P_{u} \\ (2) & objective function = min_{Q, P} \sum_{(u, i) \in k} {(r_{ui} - Q_{i}^{T} P_{u})}^{2} + λ (‖ Q_{i} ‖^{2} + ‖ P_{u} ‖^{2}) \\ (3) & {\hat{r}}_{ui} = μ + b_{i} + b_{u} + Q_{i}^{T} P_{u} \\ objective function = min_{Q, P} \sum_{(u, i) \in k} {(r_{ui} - μ - b_{u} - b_{i} - Q_{i}^{T} P_{u})}^{2} \\ (4) & + λ (‖ Q_{i} ‖^{2} + ‖ P_{u} ‖^{2} + b_{u}^{2} + b_{i}^{2}) \end{array}$

The process extracts product features and user opinions explicitly from reviews, incorporates both user-feature and item-feature relations as well as user-item ratings into a new unified explicit matrix factorization framework. UFCM and IFCM replaces the P and Q latent matrices where the latent features are explicitly given in EMF. It mainly focuses on user attention and item qualities from the feature space. The framework combines implicit as well as explicit factors to achieve high accuracy as well as explainability. The explicit feature level explanations help both recommendation and disrecommendation items.

Fig. 3.

Tensor factorization.

4.6. Multi-criteria based tensor factorization model (MCTFM)

User–Item ratings are used to predict the recommendations about two-dimensional data, and hence latent factor models such as matrix factorization work well. When the features of interest extracted from the review texts using sentiment lexicon are added as a new dimension, it becomes three-dimensional data, and therefore tensor representation is more suitable. Tensor of order N is a high dimensional mathematical notation that represents an object in a multi-linear concept [15]. Figure 3 is a pictorial representation of TF where T is the tensor containing the observations (inferred multi-criteria ratings), S specifies rating observation, U users latent matrix, I items latent matrix and F product features latent matrix. T is a tensor rating matrix $R_{m \times n \times f}$ where m is the number of users, n the number of items and f the number of product features of interest. The 3-dimensional tensor factor is denoted by Eq. (5). Equation (6) shows the formula for rating prediction and Eq. (7) shows the objective function (error) to be minimized. $\begin{array}{l} (5) & T = S \times U \times I \times F \\ (6) & {\hat{r}}_{mnf} = \sum_{i = 1}^{d_{1}} \sum_{j = 1}^{d_{2}} \sum_{k = 1}^{d_{3}} S_{ijk} . U_{mi} . F_{fj} . I_{nk} \\ (7) & objective function = min \sum_{(m, n, f) \in | D |} {(r_{mnf} - {\hat{r}}_{mnf})}^{2} \end{array}$

Various decomposition and factorization techniques used are Tucker decomposition [15], Non-Negative tensor factorization (NTF) [15], Exponential family tensor factorization (ETF) [11], Full rank tensor completion (FTC) [21], and so forth. Tensor factorization avoids the sparsity problem and produces more meaningful physical interpretations. In the proposed factorization model, the 3D tensor is decomposed by the Tucker decomposition method with presumed core size in advance. Tucker decomposition is the extension of Multi-linear Principal Component Analysis and Higher-Order Singular Value Decomposition (HOSVD) [15] that are viewed as the generalization of Singular Value Decomposition (SVD). It decomposes the high-order tensor into one core tensor and some set of matrices based on various dimensions. Tucker decomposition has been categorized into various types, namely Tucker1, Tucker2 and Tucker3. The rating data along with extracted features is computed by Higher-Order Orthogonal Iteration (HOOI) [15] under tucker3 decomposition. It undergoes a number of iterations to predict the values that helps us to generate valuable recommendations with explanations for the product.

4.7. Multi-criteria based factorization machines (MCFM)

Factorization Machine (FM) is a generic approach that allows to mimic most factorization models by feature engineering. This way, FMs combine the generality of feature engineering with the superiority of factorization models in estimating interactions between the input variables [24]. FMs are not limited to specific types of data – they are general predictors. All nested variable interactions are modeled by the factorizarion machine that can be compared to a polynomial kernel in Support Vector Machines (SVM), but uses a factorized parameterization instead of a dense parameterization like in SVM. Factorization machines can be applied to a variety of prediction tasks like regression, classification and ranking. In regression, utility function $y (x)$ can be used directly as the predictor and the optimization criterion is to minimize the least square error on the dataset. In binary classification, the sign of $y (x)$ is used and the parameters are optimized to minimize hinge loss or logit loss. In the ranking, the vectors x are ordered by the score of $y (x)$ and optimization is done over pairs of instance vectors.

The model Eq. (8) for a factorization machine with degree $d = 2$ is defined as: $\begin{matrix} (8) & y (x) = w_{0} + \sum_{j = 1}^{p} w_{j} x_{j} + \sum_{j = 1}^{p} \sum_{j^{'} = j + 1}^{p} x_{j} x_{j^{'}} \sum_{f = 1}^{k} v_{j, f} v_{j^{'}, f} \end{matrix}$ where k is the dimensionality of the factorization and the model parameters $w_{0}$ to $w_{p}$ and $v_{1, 1}$ to $v_{p, k}$ are real values. The first half of the FM model contains the unary interactions of each input variable $x_{j}$ with the target exactly as in a linear regression model. This unary interaction is denoted by weight $w_{j}$ for every variable $x_{j}$ . The second half with the two nested sums contains all pairwise interactions $⟨ v_{j}, v_{j^{'}} ⟩$ of input variables ( $x_{j}$ and $x_{j^{'}}$ ). The important difference to standard polynomial regression is that the effect of the interaction is not modeled by an independent parameter $w_{j}$ but with a factorized parametrization which corresponds to the assumption that the effect of pairwise interactions has a low rank [25]. This allows FMs to estimate reliable parameters even in highly sparse data where standard models fail. In MCFM, the matrix with multiple ratings, and meta data content features are used to find the relationship between the users and items.

5. Results and discussions

5.1. Dataset

The SemEval 2018 dataset used for irony detection and removal consists of 4792 English tweets that are collected from 2676 unique users. The tweets are manually labeled using a fine-grained annotation scheme for irony [28]. The training set contains 1911 ironic tweets and 1923 regular tweets, and the test set includes 958 tweets. We have enhanced this dataset with Sarcasm Product Review Dataset [9] which contains 437 ironic reviews and 817 regular reviews.

The “cellphones and accessories” dataset from Amazon products used for explainable product recommendation contains 194,439 records of review and rating are taken for product recommendation [20]. It comprises of 27,879 unique users and 10,429 unique products. Various features of interest are productID, reviewerID, overall (rating), reviewtext, summary and reviewtime. Metadata features such as title, price, salesrank, brand and category are used as content features. We have carried out 10-fold cross validation for our experiments.

The textual content comprises of a minimum of 18 token of words to maximum of 706 token of words. The distribution of the data ranges from a minimum of 13 sentences with 88 tokens to a maximum of 892 sentences with 556 tokens. The tokens are considered after the stop words removal. Most of the sentences have the tokens in the range of 350 to 600.

5.2. Performance comparison

Results of various models used for irony detection are listed in Table 4. Multilayer Perceptron (MLP) and Deep Learning models like Long Short Term Memory networks (LSTM) and Bidirectional Encoder Representations from Transformers (BERT) are used for the detection of ironic and sarcastic content in the review text. BERT model provides better accuracy than other models.

Table 4
Performance results for irony and sarcasm detection

Algorithm Accuracy Precision Recall F1-Score

MLP 0.5727 0.348 0.361 0.334

LSTM 0.7812 0.782 0.808 0.795

BERT 0.8672 0.865 0.775 0.817

Algorithm	Accuracy	Precision	Recall	F1-Score
MLP	0.5727	0.348	0.361	0.334
LSTM	0.7812	0.782	0.808	0.795
BERT	0.8672	0.865	0.775	0.817

Table 5 shows the results of the performance of product recommendation system using traditional techniques with ratings alone. Content-based filtering (CBF) gives larger error when compared to collaborative filtering techniques. In model based CF, Singular Value Decomposition (SVD $+ +$ ) performs better than other algorithms.

Table 5

Comparison of algorithms with ratings alone

Algorithms	MAE	RMSE	MSE
SVD	0.9000	1.160988	1.3479
SVD $+ +$	0.891952	1.15949	1.3444
NMF	1.073729	1.382393	1.9110
CoClustering	0.927179	1.29830	1.6856
KNN	1.061831	1.361908	1.8548
CBF	1.12145	1.40943	1.9865

Table 6 shows the comparison of error metrics of traditional algorithms using ratings alone and review text alone. Traditional algorithms such as SVD, SVD $+ +$ , NMF and KNN underperform when adding the review features as regularisers along with ratings. MSE and RMSE values increase for these algorithms.

Table 6

Comparison of traditional algorithms with text alone and rating alone

Algorithms	RMSE with Ratings	RMSE with Reviews	MSE with Ratings	MSE with Reviews
SVD	1.160988	1.1671	1.3479	1.3621
KNN	1.361908	1.3723	1.8548	1.8832
NMF	1.382393	1.4190	1.9110	2.0136
SVD $+ +$	1.159490	1.1731	1.3444	1.3762

Figure 4 shows the error metrics for the traditional algorithms with ratings and with the text reviews. From the figure, it is clear that the addition of text features increases the error rate rather than reducing it. This is because all the text features are taken into consideration without any filtration.

Fig. 4.

Error metrics comparison for traditional algorithms.

Table 7

Metrics of reviewtext as regulariser and as features

Algorithms	RMSE with Review	MSE with Review
MF	1.1707	1.3705
NeuMF	1.1690	1.3665
SVD	1.1771	1.3621
KNN	1.3723	1.8832
NMF	1.4190	2.0136
SVD $+ +$	1.1723	1.3744
HFT	1.1657	1.3588
Deepconn	1.1946	1.427
Narre	1.2238	1.4978
ALFM	1.1671	1.3621

Table 7 shows the results for using the reviewtext as a regulariser and as features. Error value increases for the algorithms SVD, SVD $+ +$ and NMF while including the text reviews. Deep learning algorithms like Deepconn [35] and Narre [5] have higher error rates than other algorithms as they could not infer the essential features from the reviewtext and takes all the features for processing. HFT algorithm [19] performs better when compared to other algorithms with less error rate. Aspect-aware Latent Factor Model (ALFM) and SVD give the next better results [6].

Table 8 depicts the performance results for the comparison of algorithms with multi-criteria based hybrid filtering techniques(MCCBHF) like Multi-Criteria based Tensor Factorization Model (MCTFM), Multi-Criteria based K Nearest Neighbors (MCKNN), Multi-Criteria based Factorization Machine (MCFM) and Multi-Criteria based Explicit Matrix Factorization (MCEMF). The comparison is made for the original dataset with all the reviews and the cleaned dataset after removing the ironic and sarcastic reviews. The state of techniques in literature HFT, Deepconn, Narre, Transnet, ALFM are done for the whole dataset. In all the memory-based and model-based collaborative filtering techniques, the content features from metadata and the features of interest extracted from text reviews are added to increase the performance of the personalized recommendation. All the proposed MCCBHF methods performed well, when compared to other CF and CBF algorithms. It is also shown from Table 8 that the proposed MCCBHF algorithms are having less error rate in original dataset itself when compared to other algorithms. The error rate is still reduced on the removal of ironic reviews. This depicts that the minimum error rate is due to the concept of multi-criteria based content boosted hybrid filtering and not because of the ironic content removal.

Table 8

Comparison of algorithms with ratings and text reviews

Algorithms	RMSE for cleaned dataset	RMSE for original dataset
LSTM	1.4780	1.6320
HFT	–	1.1657
Deepconn	–	1.1946
Narre	–	1.2238
Transnet	–	1.2070
ALFM	–	1.1671
TFM	1.2311	1.2972
MCTFM	1.0743	1.1224
KNN	1.3723	1.3663
MCKNN	1.2747	1.2998
FM	1.1698	1.2245
MCFM	1.0256	1.0625
MF	1.1707	1.2311
MCEMF	1.1278	1.1413

Fig. 5.

Accuracy for MCCBHF algorithms.

Figure 5 shows the accuracy of different models developed. MCFM provides better accuracy and less error rate than other techniques as factorization machine can deal with different dimensions of the data and the interactions among the data very well.

6. Conclusion

User satisfaction can be increased by providing recommendations based on the preferences of the user and giving an explanation for recommending the product. User preferences can be identified by their implicit and explicit actions. An easy method to determine the interests of the user is to analyze the user’s reviews. Explanations can also be generated by analyzing the product reviews. For a particular user, features of the user’s interest are extracted from his reviews on various products, and item features are extracted from all the reviews of that particular product. Item contains both positive and negative comments for each feature.

The main challenge lies in verifying the trustworthiness of the text reviews. At a first level, sarcastic reviews are identified and removed from the dataset. Multi-criteria ratings always give better insight into the reason behind the rating of the user, which can also be used to provide explanation for the recommendation. Various collaborative filtering techniques such as KNN, TFM, FM and MF are boosted with the content features and the performance of MCKNN, MCTFM, MCFM and MCEMF are analyzed. With the addition of content features to the basic CF techniques, all MCCBHF algorithms performed better than others. Among the MCCBHF algorithms, MCFM performed the best. This explainable product recommendation system increases the transparency and satisfaction of users in buying the recommended product.

To conclude, we have inferred the user preferences in the form of multi-criteria ratings with the help of the single rating, text reviews and the metadata features of the products. The proposed hybrid filtering methods performed well when compared to the existing techniques as discussed in the results section. There is still scope for reducing the computation time and for extracting features of interest from other explicit and implicit feedback. Further processing of the review text can be done to identify informative reviews and to detect fake reviews for better performance.

Availability of data and material

The datasets used in the experiments are publicly available in the online repository.

Competing interests

The authors declare that they have no conflict of interest.

References

Aciar ,

Zhang ,

Simoff and

Debenham , Informed recommender: Basing recommendations on consumer product reviews, IEEE Intelligent systems 22(3) (2007).

Bao ,

Fang and

Zhang , Topicmf: Simultaneously exploiting ratings and reviews for recommendation, in: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 2–8.

Barbieri and

Saggion , Automatic detection of irony and humour in Twitter, in: ICCC, Fifth International Conference on Computational Creativity, Ljubljana, Slovenia, 9th–13th, June 2014, 2014, pp. 155–162.

Bauman and

Tuzhilin , Discovering contextual information from user reviews for recommendation purposes, CBRecSys 2014 (2014), 2–9.

Chen ,

Zhang ,

Liu and

Ma , Neural attentional rating regression with review-level explanations, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1583–1592.

Cheng ,

Ding ,

Zhu and

Kankanhalli , Aspect-aware latent factor model: Rating prediction with ratings and reviews, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 639–648.

M.F.

Dacrema ,

Cremonesi and

Jannach , Are we really making much progress? A worrying analysis of recent neural recommendation approaches, in: Proceedings of the 13th ACM Conference on Recommender Systems, 2019, pp. 101–109. doi:10.1145/3298689.3347058.

R.M.

D’Addio ,

M.A.

Domingues and

M.G.

Manzato , Exploiting feature extraction techniques on users reviews for movies recommendation, Journal of the Brazilian Computer Society 23(1) (2017), 1–16. doi:10.1186/s13173-016-0050-7.

Filatova , Irony and sarcasm: Corpus generation and analysis using crowdsourcing, in: Lrec, Citeseer, 2012, pp. 392–398.

10.

Ghosh and

Veale , Fracking sarcasm using neural network, in: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2016, pp. 161–169. doi:10.18653/v1/W16-0425.

11.

Hayashi ,

Takenouchi ,

Shibata ,

Kamiya ,

Kato ,

Kunieda ,

Yamada and

Ikeda , Exponential family tensor factorization: An online extension and applications, Knowledge and information systems 33(1) (2012), 57–88. doi:10.1007/s10115-012-0517-6.

12.

Hernández-Farías ,

J.-M.

Benedí and

Rosso , Applying basic features from sentiment analysis for automatic irony detection, in: Iberian Conference on Pattern Recognition and Image Analysis, Springer, 2015, pp. 337–344.

13.

Hutto and

Gilbert , VADER: A parsimonious rule-based model for sentiment analysis of social media text, in: Proceedings of the International AAAI Conference on Web and Social Media 8(1), (2014), 216–225. doi:10.1609/icwsm.v8i1.14550.

14.

Kaushik and

Mishra , A Scalable, Lexicon Based Tchnique for Sentiment Analysis (2014), 01–05, arXiv preprint arXiv:1410.2265.

15.

T.G.

Kolda and

B.W.

Bader , Tensor decompositions and applications, SIAM Review 51(3) (2009), 455–500. doi:10.1137/07070111X.

16.

C.W.

Leung ,

S.C.

Chan and

F.-L.

Chung , Integrating collaborative filtering and sentiment analysis: A rating inference approach, in: Proceedings of the ECAI 2006 Workshop on Recommender Systems, 2006, pp. 62–66.

17.

Liu , Sentiment analysis and opinion mining, Synthesis lectures on human language technologies 5(1) (2012), 1–167. doi:10.1007/978-3-031-02145-9.

18.

Maynard and

M.A.

Greenwood , Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis, in: Lrec, 2014, pp. 4238–4243.

19.

McAuley and

Leskovec , Hidden factors and hidden topics: Understanding rating dimensions with review text, in: Proceedings of the 7th ACM Conference on Recommender Systems, 2013, pp. 165–172. doi:10.1145/2507157.2507163.

20.

McAuley ,

Pandey and

Leskovec , Inferring networks of substitutable and complementary products, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 785–794. doi:10.1145/2783258.2783381.

21.

Qi ,

Zhang and

Chen , A tensor rank theory and maximum full rank subtensors, 2020, arXiv preprint arXiv:2004.11240.

22.

Rajalakshmi ,

S.M.

Rajendram ,

Mirnalinee et al., SSN MLRG1 at SemEval-2018 task 3: Irony detection in English tweets using multilayer perceptron, in: Proceedings of the 12th International Workshop on Semantic Evaluation, 2018, pp. 633–637.

23.

Rajalakshmi ,

S.M.

Rajendram ,

Mirnalinee et al., SSN MLRG1 at SemEval-2018 task 1: Emotion and sentiment intensity detection using rule based feature selection, in: Proceedings of the 12th International Workshop on Semantic Evaluation, 2018, pp. 324–328.

24.

Rendle , Factorization machines with libfm, ACM Transactions on Intelligent Systems and Technology (TIST) 3(3) (2012), 1–22. doi:10.1145/2168752.2168771.

25.

Rendle and

Schmidt-Thieme , Pairwise interaction tensor factorization for personalized tag recommendation, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010, pp. 81–90. doi:10.1145/1718487.1718498.

26.

Ricci ,

de Gemmis and

Semeraro , Matrix and tensor factorization techniques applied to recommender systems: A survey, Matrix 01 (2012).

27.

R.-P.

Shen ,

H.-R.

Zhang ,

Yu and

Min , Sentiment based matrix factorization with reliability for recommendation, Expert Systems with Applications 135 (2019), 249–258. doi:10.1016/j.eswa.2019.06.001.

28.

Van Hee ,

Lefever and

Hoste , Guidelines for Annotating Irony in Social Media Text, Technical Report, version 2.0. Technical Report 16-01, LT3, Language and Translation Technology Team–Ghent University, 2016.

29.

Van Hee ,

Lefever and

Hoste , Semeval-2018 task 3: Irony detection in English tweets, in: Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, 2018.

30.

B.C.

Wallace , Computational irony: A survey and new perspectives, Artificial Intelligence Review 43(4) (2015), 467–483. doi:10.1007/s10462-012-9392-5.

31.

Wang ,

Yang ,

Chen ,

Yuan ,

Geng and

Hai , Global and local tensor factorization for multi-criteria recommender system, Patterns 1(2) (2020), 100023. doi:10.1016/j.patter.2020.100023.

32.

Zhang ,

Yao ,

Sun ,

Wang ,

Long and

Dong , Neurec: On nonlinear transformation for personalized ranking, 2018, arXiv preprint arXiv:1805.03002.

33.

Zhang , Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 435–440. doi:10.1145/2684822.2697033.

34.

Zhang ,

Lai ,

Zhang ,

Liu and

Ma , Explicit factor models for explainable recommendation based on phrase-level sentiment analysis, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, 2014, pp. 83–92.

35.

Zheng ,

Noroozi and

P.S.

Yu , Joint deep modeling of users and items using reviews for recommendation, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017, pp. 425–434. doi:10.1145/3018661.3018665.

	Features: Display resolution, Camera quality, Battery life, Cost

	Item1	Item2	Item3	Item4	Item5
User1	5		7
User1	2,2,8,8		5,5,9,9
User2	5
User2	8,8,2,2
User3			6
User3			3,3,9,9
User4		6		9	3
User4		4,4,8,8		8,8,10,10	1,3,4,0

Factoring textual reviews into user preferences in multi-criteria based content boosted hybrid filtering (MCCBHF) recommendation system

Abstract

Keywords

1. Introduction

2. Related work

3. Methodologies used in the proposed approach

3.1. Multi-criteria recommendation system

Table 1 Multi-criteria rating matrix Features: Display resolution, Camera quality, Battery life, Cost Item1 Item2 Item3 Item4 Item5 User1 5 7 2,2,8,8 5,5,9,9 User2 5 8,8,2,2 User3 6 3,3,9,9 User4 6 9 3 4,4,8,8 8,8,10,10 1,3,4,0

3.3. Ironic and sarcastic content analysis

4.1. Irony and sarcasm detection and removal

4.4. Multi-criteria based content boosted hybrid filtering (MCCBHF)

4.5. Multi-criteria based explicit matrix factorization (MCEMF)

4.7. Multi-criteria based factorization machines (MCFM)

5. Results and discussions

5.1. Dataset

5.2. Performance comparison

Table 4 Performance results for irony and sarcasm detection Algorithm Accuracy Precision Recall F1-Score MLP 0.5727 0.348 0.361 0.334 LSTM 0.7812 0.782 0.808 0.795 BERT 0.8672 0.865 0.775 0.817

Availability of data and material

Competing interests

References

Table 1
Multi-criteria rating matrix

Features: Display resolution, Camera quality, Battery life, Cost

Item1 Item2 Item3 Item4 Item5

User1 5 7

2,2,8,8 5,5,9,9

User2 5

8,8,2,2

User3 6

3,3,9,9

User4 6 9 3

4,4,8,8 8,8,10,10 1,3,4,0

Table 4
Performance results for irony and sarcasm detection

Algorithm Accuracy Precision Recall F1-Score

MLP 0.5727 0.348 0.361 0.334

LSTM 0.7812 0.782 0.808 0.795

BERT 0.8672 0.865 0.775 0.817