Abstract
This paper presents a systematic review of research into package recommender systems (PRS). PRS recommend a combination of items rather than individual items, for example, multiple attractions to visit on a day trip to a city, trousers and a shirt to wear together, or multiple dishes to make up a meal. The review provides a framework for considering existing PRS research, and highlights the techniques used for package recommendation and the evaluation methods and metrics employed. It also raises many issues that warrant future research.
Introduction
Recommender systems, which recommend products or services for users to consume, have been around since the early 90’s. The goal of these systems is to overcome the problem of information overload. Different techniques have been used, such as collaborative filtering [1], content-based filtering [2] and hybrid filtering [3], which combines the former techniques. Techniques have been become increasingly sophisticated, and the application domains have been extended. Currently, recommendations are no longer only for single items, but also for sets of items, for instance, to plan a tourist route [4], decide what courses to follow [5], or what dishes fit together as a meal.
Package recommendation (also known as bundle recommendation) is defined as the suggestion of a group of items that fit well together [6]. Despite this general definition, the application of package recommendation can take many forms. For instance, in travel recommendations, points of interest (POI) can be combined to propose an ideal route for a user [4, 7, 8]. This type of recommendation can be seen as a sequential package, as the recommended items in the package have to be consumed in a certain order, as this effects the appropriateness of the package. Another example is the recommendation of clothes (e.g. the combination of a skirt and a shirt). In this case, the sequence of the items is of no importance, and the package is called a complementary package.
In the past years, several systematic literature reviews have been conducted in the recommender systems domain. However, to the best of our knowledge, no systematic literature review has been conducted focused specifically on package recommender systems (PRS), which will be the focus of this paper. We believe this is the right time for such as survey, given the increasing interest in PRS (see Fig. 1). The goal of this survey is to find the state-of-the-art in this field, focusing on domains, techniques, and evaluation methods and metrics used.
Number of PRS articles through the years in the survey.
In the following section, the research method for this systematic literature review will be explained. It contains the research questions, search strategy, criteria and procedures to find, analyze and synthesize the data. Section 3 discusses the results of the review. The domains, package types, recommendation input types, PRS phases and techniques, and evaluation methods and metrics are explained. The last section contains the conclusions of this research and directions for future work.
Research questions
The main goal of this research is to find the techniques used for package recommendations and how the performance of each technique is evaluated. To investigate this, we used the following research questions:
In what domains are package recommender systems applied? What are the different techniques for package recommendation and how do they work? What evaluation methods are used to evaluate the performance of the package recommendation methods? What evaluation metrics are used when evaluating the performance of package recommendation methods?
To find all the relevant information about package recommendation, we defined a search strategy. At first relevant databases were selected to support a good search. The selected databases are listed in Table 1. The second step was to create a search string to search the databases.
Search sources
Search sources
To create a search string we started with basic terms such as “package recommendation” as keywords. By informal searches, we found more terms such as “bundle recommendation” and “clustering recommendation”. We then used synonyms for these keywords to come up with the final search string.
“package recommendation” OR “package recommender systems” OR “package recommender system” OR “bundle recommendation” OR “recommending packages” OR “recommending package” OR “clustering recommendation” OR “clustering-based recommendation”
The search with the search string in the different databases is done by the first author. A list was made of all the articles and the double articles were removed. This resulted in 141 articles as can be seen in Table 2. With the inclusion criteria, from Section 2.3, the first inclusion round was performed. The first author reviewed all the articles based on title and abstract and marked all papers with the inclusion criteria it did or did not match. 30 of the 141 articles were included. During the inclusion, a new criterion was added called ‘accessible’. Not every article could be accessed, so inaccessible articles were excluded from the research. The 30 included articles were used to perform the snowballing method. The snowballing method is a method to find new articles based on the current articles by using their references (backward snowballing) or cited by articles (forward snowballing). In this research both methods were applied. The backward snowballing method was applied 1 round, so the references of the 30 included articles were added to a list. The forward snowballing technique was applied until no new articles were found. After the backward- and forward snowballing methods were applied and the double articles were removed, 412 new articles remained from backward snowballing and 400 articles remained from forward snowballing. For the second inclusion round the methods from the first inclusion round were applied on the articles obtained by snowballing. This resulted in 74 included articles, which brings the total to 104 included articles for the research.
Inclusion of papers
To select the studies suitable for this research, inclusion and exclusion criteria are defined. The inclusion criteria mentioned in the search strategy are defined below. The inclusion criteria are split into two categories. The first category are inclusion criteria that should all be met by an article.
Publication date of an article should be between 1999 and 2018. Language should be English. An article should be accessible. The article should be peer-reviewed. The article should be a journal, conference or workshop paper. The focus of the article is on package recommendation as defined in the introduction. If this is retrievable from the title or abstract.
For the second category, every article should at least meet one of the criteria.
A method is proposed to solve a package recommendation problem. An evaluation method is used to evaluate a package recommender system. Evaluation metrics are used or compared for the evaluation of package recommender systems.
The exclusion criteria are used after the inclusion criteria are applied and have the function to exclude articles based on certain criteria. If one of these criteria is met, an article will be removed from the list.
The article is not about package recommendation as defined in the introduction. An article is similar to another article and will be removed based on a quality assessment between the two articles.
The criteria were defined before the research took place to avoid bias. However, not all criteria could be foreseen, so IC3 was added during the research.
The guidelines of Kitchenham and Charters [9] have been used for the data extraction. The data extraction is performed by the first and third authors. Both researchers had their own set of articles to review, but there was some overlap to check if data extraction was done in a similar way by both researchers. The disagreements or doubts about articles were discussed by the two researchers and solved by consensus. Checks were done on the data extracted to ensure both researchers agreed on the conclusions drawn. The researchers used an Excel spreadsheet to extract data, which was prepared before the research. The spreadsheet contained several columns, and some columns were added during the research because of new insights. For instance, for the package recommendation methods’ column, phases were identified which resulted in new columns for each phase. Another example is the constraints which were added as a column. This resulted in the following data which has been extracted: review date, title, authors, year of publication, reference, database, package recommendation methods used per phase, constraints, domain, evaluation methods used, evaluation metrics used, research method, data set and size of the data set.
During the data extraction process the exclusion criteria were applied. Also there were a few cases where several articles were written about one research. Through the years the research was expanded, but the core remained the same. In these cases, the data from the articles was extracted, but were processed as if resulting from one article. Finally, some articles were excluded during the extraction process. A good example is P80 (Table 3), which was excluded during the data synthesis process. During the analysis it became clear that no useful data could be extracted from that article. The exclusion and combining of articles during the data extraction and synthesis stage resulted in a total of 79 articles that have been used for this literature review. In Table 3 all the 79 included articles plus article P80 which was excluded during the data extraction, are mentioned. The P-numbers will be used during the research and behind each P-number is a number to indicate the place of the article in the reference list.
Article references for articles selected in the systematic literature review
Article references for articles selected in the systematic literature review
Package recommendation domains in systematic review articles
Domains
As shown in Table 4, PRS have been used in several domains, though substantially more in the travel domain than other domains. As discussed in [89], recommendation domains have different characteristics.
Package types in systematic review articles
Package types in systematic review articles
In economics, a distinction is made between experience goods (which consumers learn about through experience) and search good (for which direct experience is not needed) [90], and between sensory and non-sensory products [91]. Based on this, Tintarev and Masthoff [89] distinguish between items which are easy to evaluate objectively (such as light-bulbs and cameras) and those which require an experiential and subjective judgement (such as holidays and music). Almost all domains used for PRS are subjective ones.
In economics, cost is also seen as an important characteristic of a domain, with this not only including purchase price, but also the time and effort involved in the purchase/consumption and the psychological, physical, functional and social risk [92, 93, 91]. Tintarev and Masthoff [89] distinguish between high and low investment recommendation domains. Travel tends to be a high investment domain, so most papers are about package recommendation in high investment domains, though there is also work on lower investment domains (such as movies, books, cloths, and food). Additionally, what is a low investment domain when recommending a single item (such as movies and books) becomes a large investment when deciding on a set of movies to see or books to read, as this can still involve a large time investment to consume the package and more costs as paying for multiple items. In a higher investment domain, a user may take longer to decide, so it is likely that users will take longer to decide on packages than on single items, which increases the users’ information needs and requirements for explanations in PRS.
Domain characteristics matter as they may impact the importance of different recommendation quality metrics. For example, a PRS can be incorrect in multiple ways: it can overestimate how much a user may like a particular package, or it can underestimate how much a user may like a package. The first can lead to recommendations the user may not end up liking. The second can lead to the user missing out on packages they may have liked. Research by [94] showed that in general users considered overestimation as less helpful than underestimation, but that overestimation was particularly problematic in high investment domains. They also found that users tended to be more forgiving about over- and underestimation in subjective domains. So, domain characteristics may impact the best method for combining individual items into packages.
Domain characteristics also influence the kind of constraints that are used in the package creation process. For example, in the travel domain, normally items are suggested which are geographically close to each other. For example, bundling “the Great Wall” in China with the “London Tower Bridge” in England would not be a great idea. But coordinating the “London Tower Bridge” with “Buckingham Pallace”, “Big Ben” and “the British Museum” can be considered a good bundle since they are located in the same city. In the travel domain, the location characteristic prunes many items as package candidates.
Input types in systematic review articles
Table 5 shows the package types used in the systematic review articles. We distinguish between PRS that serve recommended packages as a sequence of items or as a complementary set of items. In a sequence of items, the items are presented in an ordered list. When the user consumes a recommended package, the user needs to follow the items in the order in which they are given in the list. For example, in the music domain a recommended playlist contains a sequence of songs. In the travel domain, a recommended sightseeing tour package contains a sequence of attractions. In contrast, in a complementary set of items, items are provided as an unordered list. For example, in the grocery domain, a PRS can bundle tea, coffee, and sugar in a package.
Combining the data in Tables 4 and 5, it is clear that in the systematic review articles sequences were only used in the travel and education domains. For some domains, sequences are not really an option. For example, it is hard to envisage a sequence of furniture items, cosmetics, or electronics. For some, it may be possible to do sequences, but it is rather far-fetched: for example, a sequence of cloths to wear over a week. For some, sequences are a possibility, such as movies to watch during a film festival, or books to read over a longer period that are sequentially thematically linked, but the commercial market may be small. In the education domain, there are two kinds of package recommendations: one that recommends a learning path (learning materials to consume in a certain sequence to reach a learning goal) and one that recommends a set of learning materials that are complementary to each other to reach a set of goals. In the travel domain, the vast majority of work is on sequences: typically points of interests are combined into an itinerary. The exceptions are cases where for example a package contains a flight and a hotel. We note that unexpectedly music playlists are not included here, whilst there has been research on playlist recommendation (e.g., [95, 96, 97]). In fact, music is absent as a domain for package recommendation in Table 4. This seems an artifact of the search terms that were used in the systematic review, with people in music recommendation having used the domain specific term “play list recommendation” (instead of using more generic terms such as package, bundle or cluster).
Package consumer
The PRS can recommend a package to different types of package consumers. It the review, we found two types: individual users and a group of users. When the PRS recommends a package to a user, the PRS needs to consider the user’s preferences. This situation is similar to that of traditional recommender systems, which need to gather and analyze a user’s preferences from different inputs, for example, intrinsic and extrinsic input. When the PRS recommends a package to a group of users, it not only needs to consider each member’s preferences but also needs to aggregate the members’ preferences to provide a solution that is best of the group. In addition, there are situations where PRS are able to recommend to both types of users (individuals and groups). The articles of the systematic review were almost all about package recommendation to individuals, with the exception of P18, P23, P30, P43, and P48 which dealt with package recommendation to groups. The latter were all in the travel domain (or unspecified in terms of domain), which is remarkable as there seems an opportunity for package recommendation to groups also in other domains such as food.
Recommendation input
As shown in Table 6, PRS differ in the input they use to produce recommendations. Just like traditional individual item recommender systems, PRS can use implicit and explicit input from users. Implicit input is provided unconsciously (for example, item clicks or time spent looking at an item), whilst explicit input is provided consciously (for example, in the form of ratings on a 1 to 5 scale). Contrary to traditional recommender systems which use input regarding individual items, PRS can also use input regarding item packages, such as package ratings or time spent looking at a package.
As seen in Table 6, both implicit and explicit input are used in PRS, with implicit input being used more often. Combining the data from Tables 4 and 6, this seems an effect of domain, with implicit feedback being used a lot in the travel domain (in 30 out of 47, i.e. 64% of cases). For example, a PRS in the travel domain can collect implicit input in the form of the previous journey of a user (containing a travel package or single destination) or the global positioning systems (GPS) coordinates of their current position. In some cases, such input helps the PRS to discard irrelevant items to be recommended. For example, in the travel domain when using GPS coordinates, the PRS discards attractions which are too far away from the user’s current position. PRS focused solely on the movies domain all used explicit input (typically ratings from the MovieLens dataset). There were only 6 PRS articles that combined both implicit and explicit input, so more research could be done on this combination. There was no clear preference for input type in PRS for education, with both implicit and explicit input being used.
Another input type which is often used to recommend packages is the items’ features. For example, in the travel domain, an attraction can have many features, such as a description, attraction type (e.g. museum), opening hours, season, cost, and location. Whilst this input type is also occasionally used in traditional recommender systems as a basis for content-based filtering, in PRS it plays an additional role, in that it forms the basis for constraints (e.g. rules on which colours can be combined in clothing outfits or which locations are good to combine in a tourist route). As seen in Table 6, features are used in most PRS articles surveyed (79% of articles that specified input type). This is even more pronounced in the travel domain (37 out of 42, i.e. 88%, of articles solely on travel). Combining the data from Tables 5 and 6, features are used more in PRS recommending packages as sequences than as complementary sets (84% compared to 73%).
Explicit and implicit data can concern individual items or item packages. Table 7 shows which articles used input data only for individual items, and which also used data for packages. For implicit data, we assumed that travel (point of interest) check-in data provides package data (as one can see sequences in check-ins) and that items bought together also can be regarded as a package. Only some articles (27%) that used only explicit data used package data compared to most articles (83%) that used only implicit data.
Only individual vs also package input
Only individual vs also package input
Researchers have solved the PRS problem using different techniques. In general, they used phases to recommend packages. We have identified three phases and classified articles by the techniques they use in each phase:
The model learning phase where the PRS learns several aspects required to produce recommendations such as a user’s preferences. The package creation phase where the PRS mixes and matches items into packages and collects these packages into a package candidates list. The package selection phase where the packages to be recommended are selected from the package candidate list. In this phase, the researcher evaluates the package candidates using techniques such as top-N.
Even though we classified the PRS articles based on the three phases, some articles only used one or two. For example, some researchers used pre-defined packages or random coordination for the second phase, and focused on understanding the user’s preferences and also selecting the best packages to recommend to a user. Tables 8–10 show which techniques were used for each phase in the survey articles.
In the model learning phase, the PRS uses the input as described in Section 3.4 to obtain knowledge about the user’s preferences and also item characteristics. As shown in Table 8, several techniques are used, in particular clustering, collaborative filtering (CF), user preference modelling, item relationship modelling, and topic modelling.
PRS Phase 1: Model learning
PRS Phase 1: Model learning
The algorithm used is also influenced by the PRS input. PRS which only use explicit input such as ratings, tend to use CF methods such as matrix factorization (MF), item-based CF, memory based CF and so on. For example, Wibowo et al. [11, 12], used user-item-rating and user-package-rating matrices of clothes as input and used MF to obtain users’ and items’ latent factors. Combined with an aggregation function (such as minimum, maximum, or harmonic mean), they then used these latent factors to approximate the package rating.
PRS that used unstructured data on items or packages, such as text, often used topic modelling techniques such as Latent Dirichlet Allocation (LDA). LDA parses the text descriptions and automatically extracts specific topics which relate to items or packages. For example, Xiong et al. [83] used travel website information as input and used LDA-based topic analysis to automatically extract the topics. They then matched the extracted topics with the users’ interests. Zhang et al.[62] used LDA to classify the POI categories from content descriptions of each POI. They then determined whether a proposed route (containing a sequence of POIs) is feasible in the sense of containing at least a certain number of POI categories.
Some PRS used item relationship modelling (such as Markov chain, probability model, Apriori and ontology) to model the relation between items. For example, Yu et al. [38] used a Markov chain-based approach to model the relations of particular products with regards to the users’ sequential behaviour. Mikhailov et al. [85] used an ontology to model a.o. the similarity between attractions. With the available survey data no real relationship between the input type and item relationship modelling can be found.
Some PRS used domain specific methods, for example, some travel PRS used check-in data to deduct user preferences. The more a POI is visited by a certain user, the more preferred a certain POI is according to this method. Then the categories of the most visited POIs are determined to calculate the preferences of a user for the categories in the system.
In some cases clustering was used to learn the user model. Clustering is used to find similar groups of items or similar groups of users. For instance, in P4 clusters of items are made to determine the preferences of a user for a certain group of items. Clustering users is used to find users with similar preferences or qualities. For instance in P13, students with similar results were clustered to find what courses fit best with a certain group of students. Clustering is not only applied for individual recommendations, but also for group recommendations. For example, in P30 user clustering is used to determine the preferences of a group of users for a group PRS. Except for P30 all other articles that used clustering made use of implicit input data.
Several PRS did not use model learning, for example, because their package construction did not require user preferences, but only used known item features and constraints. For example, in P37 the goal is to recommend a team consisting of complementary team members. This is based on members’ skills and social fit in a team, so model learning of preferences is not needed. Model learning is also not needed when users explicitly enters their preferences (e.g. as a search query) and only that information is used to make recommendations, so no history or data from other users is used.
As mentioned above, the package creation phase is used to generate a package candidates’ list. In this process, items are combined with other items into a package.
As can be seen in Table 9, many researchers have regarded this as a knapsack problem, where the solution is a combinatorial optimization in which a collection of items is selected which maximizes the value or minimizes the cost, whilst remaining within certain constraints. For example in the travel domain, the knapsack problem can be defined as how to include as many POIs as possible in an itinerary, whilst remaining within the user’s budget (money and/or time). Several knapsack algorithms were implemented, using for example dynamic programming, search algorithms (e.g., greedy search), and evolutionary algorithms (such as Ant Colony Optimization [71]).1
Ant Colony Optimization iteratively randomized the items involved in a package and evaluated its estimated value (such as travel distance), whilst improving the solutions randomly in several iterations.
PRS Phase 2: Package creation
Some researchers have regarded package creation as a clustering problem, where the package combination is obtained from a common value in a group of data. For example, in the travel domain, POIs can be clustered based on their attributes (such as geographical location, POIs type and so on). In the survey articles, several clustering methods have been applied, but two methods are used the most: nearest neighbours and fuzzy clustering. Nearest neighbours (k-NN) is a clustering method which finds the k most similar items based on a target item. Sometimes k-NN is used by other methods. For instance, the papers which make use of the BOBO algorithm use k-NN to create packages around BOBO’s pivots (target items). The other commonly used method fuzzy clustering is very similar to k-means. Just like in k-means, k clusters of data points are created with fuzzy clustering. However, where a data point with k-means can only belong to one cluster, in fuzzy clustering a data point can belong to multiple clusters. In the survey articles, clustering is used equally often with implicit and explicit data. The input data is in 75% individual item data, and also in 75% of the articles clustering is used to produce complementary packages (rather than sequences). So, there has been most focus on clustering when producing complementary packages based on individual item data.
When a user likes two items individually, this does not necessarily mean that the user would like the combination of the two. For example, somebody may like a red pair of trousers and an orange shirt, but may not want to wear them together. Therefore, when combining items into packages, most papers used constraints. For example, in the clothing domain, constraints have specified which colours, patterns, and formality to combine. Constraints can be manually constructed or learned. In the travel domain, a cost function such as travel time and distance is often used as a basis of constraints (e.g., not to select two items that are too far from each other).
Some papers used predefined packages, or assigned items randomly to packages to create a package list. Others created all possible packages, based on a package model, which specifies the frequency of item types in a package or other constraints. For example, in a clothes PRS, if a package can contain a shirt and a pair of trousers, they would generate all possible combinations of a shirt and pair of trousers. Similarly, in an educational PRS, all possible course sequences can be generated, taking course pre-requisites, maximum number of courses to take in each term, and which course runs in which semester into account (P13).
PRS Phase 3: Package selection
Package selection is the last phase of package recommendation. In this phase, the PRS selects a number of packages from a package candidates’ list obtained in the previous phase. As can be seen in Table 10, the most common approach for recommending packages is top-N, whereby the N best packages (as estimated by the PRS) are recommended to the user in a ranked list. Top-1, a sub case of Top-N, where only the best package is recommended is also very popular, and has been used more often than in individual item recommendation. In an individual item recommender, a popular alternative to Top-N is to show all items (not ordered) with a rating system (e.g. stars) to indicate the recommender’s estimated user preferences for each item. This is not really used in PRS, simply because the number of packages tends to be far too large. Some PRS show multiple packages that have been deemed suitable without providing a ranking. It is likely that the number of packages shown to the user depends also on the complexity of packages. For example, when a package contains many items, it is likely that the number of packages shown is smaller. We did not find a domain effect yet: for example, the percentage of travel PRS that used Top-1 is 34% and very similar to the 32% of all PRS that used Top-1.2
Whilst many travel PRS produce more complicated packages containing multiple points of interests, some travel PRS just combine flights and hotels, so the domain categories we used is not necessarily a reflection of package complexity.
Table 11 shows how PRS have been evaluated.3
Some articles used more than one form of evaluation, so the number of evaluation forms can be higher than the total number of articles.
The vast majority of articles surveyed did not evaluate the PRS through a user study, but instead used an off-line computational evaluation method. Most (64%) papers that used a computational evaluation method measured accuracy. This is in line with many studies of traditional recommender systems, where the emphasis has been on prediction accuracy and top-N recommendation accuracy.
Evaluation methods and metrics in systematic review articles
Evaluation methods and metrics in systematic review articles
Accuracy. Prediction accuracy normally measures the extent to which the predicted item ratings correspond with the actual item ratings4
Where actual item ratings are not necessarily given explicitly, but can be inferred through implicit input.
Prediction accuracy treats errors in predictions for good and bad items equally, whilst recommender systems tend to only show a limited number of items to a user with good rating predictions. Therefore, recommender systems are also often evaluated on the relevance of the recommendations in a ranking situation using a top-N recommendation task [99].5
Metrics such as Fagin’s intersection metric are in between these two categories: they consider rankings and can be applied to top N.
Additionally, some accuracy measures of relevance take the ranking positions into account such as normalized Discounted Cumulative Gain (nDCG), Average Precision (AP), Mean Average Precision (MAP), and Weighted Average Precision (WAP). Table 11 shows that accuracy measures based on relevance are most popular in PRS evaluations.
Many papers used a combination of different accuracy metrics. For example, papers that measured the accuracy of predicted item rankings, often also measured the relevance of the recommendations using nDCG. Similarly, papers that measured the relevance of recommendations using nDCG also reported Precision@N (with the exception of P21).
Measuring accuracy requires a gold standard: actual ratings6
Explicitly or implicitly acquired.
More recently, researchers have been arguing against only using accuracy measures of recommender systems, and advocating to use also measures such as coverage, confidence, diversity, novelty, serendipity, utility, and scalability [100].
Scalability is the extent to which the PRS can deal with larger data sets (in terms of processing power required and speed). In the papers reviewed, after accuracy, scalability was evaluated the most, with scalability being measured in 42% of the papers which had computational evaluations. A high proportion (75%) of papers that evaluated scalability did not evaluate accuracy.
Utility is the extent to which a recommendation is useful, so its value. Two kinds of value can be distinguished: value to the package consumer and value to the package provider. In the surveyed articles, the ways to calculate package value were often domain specific. For example, a book PRS used revenue gain to measure value for the package provider. A travel PRS used travel time as one way to measure value for the package consumer (with a higher travel time meaning lower value). An educational PRS used grade point average and graduation time, which can measure value for both package provider and consumer. Utility was measured in 27% of the papers which had computational evaluations. In about half these papers, it replaced the accuracy measurement.
Coverage is the percentage of users for which the system can provide recommendations and/or the portion of items that can be recommended. Standard measures include the Gini index and Shannon entropy. Sometimes the performance of the system can be measured also specifically for new (‘cold’) items or users (who have fewer than a certain number of ratings). Coverage was measured in only 12% of the papers which had computational evaluations.
Diversity is traditionally the extent to which recommendations are dissimilar from each other. In PRS, two types of diversity are used, intra-package diversity and inter-package diversity. Intra-package diversity is the diversity of items within a package. Inter-package diversity is the diversity between recommended packages. Out of 8 articles, only P31 used inter-package diversity. Intra-package diversity is calculated as 1 minus the average similarity between any two items in the package, which is typically calculated based on the item features. Diversity was measured in only 12% of the papers which had computational evaluations; all but one of these used features.
Cohesion is the extent to which items within a package belong together in terms of similarity. This is typically the opposite of intra-package diversity. Only 3 articles explicitly considered cohesion. The trade off between cohesion and diversity seems domain dependent. In some domains (e.g. team recommendation) cohesion may well be more important, whilst in other domains diversity may be more important.
Confidence is the system’s trust in its own predictions. We did not find any papers explicitly mentioning confidence, but did find two papers using the related concept Perplexity, which is a measure of uncertainty.
Novelty is the extent to which users were unaware of recommended items.7
Another measurement is serendipity: the extent to which successful recommendations are surprising to the user (e.g. a recommendation of a new book by their favourite author may be novel, but not surprising). Serendipity was not measured for PRS at all, neither in the computational evaluations nor user studies.
Splitting the data in the earlier time period into a training and test set as usual, whilst adding the later time period to the test set.
In principal, it is possible to evaluate aspects such as accuracy, coverage, diversity and scalability also during a user study, in which study participants interact with a PRS. However, most papers surveyed used these metrics in an off-line setting, mostly using an existing data set as a basis for both the system and the evaluation (mostly using n-fold cross-validation, whereby a part of the data is used to inform the system and a part to evaluate it, and this is repeated n times).9
In some cases (e.g. P18), a user study is done to create a new dataset, but this user study is not used to directly measure the performance of the system, but rather to construct a dataset that is used again in an off-line setting.
In some cases (e.g. P25) one dataset is used to produce the PRS and another to evaluate the PRS.
Only 23% of articles contained a user study: 13 articles combined a computational evaluation with a user study, and 5 contained only a user study. As shown in Table 11, user studies mainly focused on user satisfaction with the package recommendations (including perceived usefulness), though there was also some work on accuracy and performance. Only 3 papers contained an expert evaluation; one of these combined an expert evaluation with a user study. The expert evaluation all focused on satisfaction.
Satisfaction is how pleased users (or experts) are with the PRS and its recommendations. Satisfaction was normally measured through surveys which ask participants’ opinions about recommendations (perceived usefulness, so related to the utility measurements in the computational evaluations). Sometimes the usability of the PRS was measured (e.g. P51, P66, P76). One paper (P78) considered user retention: so how long they would keep using the PRS. Another measure that has been mentioned for recommender systems is the users’ trust in the system [100]. This can be measured by considering how many recommendations are followed or by asking users whether they find the recommendations reasonable. It is often hard to measure trust independently from user satisfaction, and as can be seen, the measurements taken in PRS for satisfaction are implicitly measuring trust as well.
Performance. To measure performance, the throughput or execution time was used. For instance, throughput was measured by counting the number of completed tasks per minute (P36, P78). Execution time is quite similar, but is focused on one task and how long it takes for that specific task to be selected and completed (P36).
Accuracy measurements in the user studies were all related to the relevance of recommendations, and either used the mean reciprocal ranking (MRR; a variant of MAP that was used in the computational evaluations), or the rate at which users accepted recommendations, clicked on them, or consumed them (so, similar to the hit-rate in the computational evaluations, but now based on the data of users who had actually used the PRS).
Conclusions
This paper presented a systematic review of PRS, which after applying inclusion and exclusion criteria looked at 79 articles in detail. This area of recommender systems’ research is still relatively immature. We note the following challenges for PRS which require future work.
Need for more open data sets that contain package ratings
Many recommender systems’ data sets have been released, but so far most are for the recommendation of individual items. Some researchers have creatively created package data sets from available data sets using several assumptions. For example, P39 created travel packages for users by using individual POI ratings. By combining these POI ratings and taking the POI popularity and intra package diversity into account a score for each package was calculated. The problem is that combining several items a user likes does not mean that the combination of those items is also appreciated by the user. Combining two pieces of clothes that somebody likes does not have to mean that they would like the combination of those items as an outfit.
A few researchers have produced their own package data sets through explicit user ratings in studies, but these data sets are still very limited in size and only available in some domains. For example, Wibowo et al. [11, 12] collected package recommendations in the clothes domain by asking participants to rate a “top” of clothes (such as a shirt) and a “bottom” of clothes (such as pants) individually and as a package. For another example, Sharma et al., P25 [34] collected ratings for sets of movies from active users of Movielens on movies each user had rated individually in the past. Researchers have often resorted to the use of implicit package data, but whilst this gives some insights in what users tend to consume together, and may therefore like, this is not necessarily indicative of the best possible combinations (as users may just not have been aware of other options) and does often make it hard to distinguish between packages which the user dislikes and packages which the users has just never noticed. Package data sets are not just important as a basis for recommendations, but are vital to get a reliable measurement of accuracy. Using the average of individual item ratings to produce a gold standard for the package rating is in many cases not right. For example, if a user adores a red pair of trousers and adores an orange shirt, this does not necessarily make this a great outfit. Similarly, a user may really like the British museum and the Victoria and Albert museum, but may be unlikely to combine them in a one day outing. Whilst other measures (such as cohesion and diversity) may contribute to the estimation of the goodness of a package recommendation, without actual user package ratings (or other ways of gather user opinions on packages), it is hard to reliably measure accuracy, or even to investigate the impact of diversity and cohesion on (perceived) accuracy. Therefore, for the field of PRS to progress, the creation of large open data sets that contain both individual and package ratings is crucial. Given the reliance of most PRS systems on features for the construction of packages, such data sets also need to contain item features.
Need for more sophisticated ways of dealing with data sparsity and package cold starts
Package rating matrices are even sparser than individual item rating matrices. There has been some initial research on ways to reduce matrix sparsity[101]. Additionally, whilst individual item recommender systems often suffer from user cold start problems (difficulty to recommend to new users who have not rated anything yet) and item cold start problems (difficulty to recommend new items which have not been rated by anybody yet), PRS have a package cold start problem: the difficulty of recommending a new package. Each new item that is added could potentially lead to a very large number of new packages that contain that item. More research is needed on how to deal with data sparsity in the package rating matrix and how to deal with the package cold start problem. Given it is unlikely that this problem can be fully solved for large scale real world systems, there is a need for research on accurate estimations of package ratings based on individual item ratings, (sparse) ratings of other packages, and item features. This research will need to be done in multiple domains, as this is certainly to a large extent domain dependent.11
Domain types may be distinguished, which may share certain parts of this accuracy estimation, for example the importance of package cohesion.
Package recommendation is computationally complex, with many approaches to model learning (e.g. of package preferences) and package creation NP-hard. It is therefore important to produce more efficient algorithms, which make optimal use of heuristics (such as constraints) that are based on evidence-based insights on which item combinations go well together in a certain recommendation domain.
Need for more sophisticated metrics
Most evaluations used the same metrics as used for individual item recommender systems. There is a need for metrics specifically developed for PRS. For example, most papers used traditional accuracy metrics, whilst in a PRS, particularly when package size increases, users may decide to consume part of a package. This means that it is no longer solely a question of comparing packages consumed (as whole entities) to packages recommended, but that the content (the items in the set) of what is actually consumed needs to be compared against what has been recommended, and that in case of the package being a sequence, also the order of consumption of individual items needs to be considered. In such cases, traditional accuracy metrics (such as Precision@N which was most popular in the papers surveyed) no longer suffice. Only a few papers used Jaccard similarity and longest common sub-sequence to perform such a comparison that takes package content into account. Even those metrics will need more work, and will need adjusting to fully capture the complexity of PRS evaluation. For example, where in the sequence the longest common sub-sequence is (for example, at the start or the end) may matter for users’ perceptions of whether the recommendation was followed and useful. If the longest common sub-sequence in a travel package was at the start, perhaps a user really liked the package, but ran out of time (or got lost). Similarly, more work is needed on diversity and cohesion metrics. Additionally, user studies could benefit from a reliable scale to measure appreciation with packages, so that users’ opinions on package details (such as the start, finish, cohesion, diversity, serendipity) can be measured. Existing questionnaires are mainly focused on individual item recommendations, and to the best of our knowledge there is no validated scale for PRS.
Need for more comprehensive evaluations and user studies
Whilst most papers contained an evaluation, predominantly these were computational evaluations and evaluations of accuracy. Clearly, there is more to the goodness of a recommendation than accuracy, just as has been argued for single item recommender systems as discussed above. Consuming a package requires more investment by the user (in time and/or money) than consuming an individual item. Recommending packages is also more complicated than recommending individual items. Whilst most of the current studies use computational evaluations, this is not enough to understand this complex problem. Users need to be more involved in the evaluations by doing user studies. This could help to better understand how package recommendation works and what a good package is.
Need for more domains to be studied
The focus of package recommendation till this point is mainly on the travel domain (with each other domain only studied in a couple of papers). However, there are many other domains that could be broader studied. This could result in a deeper and broader understanding of the package recommendation field.
