Package recommender systems: A systematic review

Abstract

This paper presents a systematic review of research into package recommender systems (PRS). PRS recommend a combination of items rather than individual items, for example, multiple attractions to visit on a day trip to a city, trousers and a shirt to wear together, or multiple dishes to make up a meal. The review provides a framework for considering existing PRS research, and highlights the techniques used for package recommendation and the evaluation methods and metrics employed. It also raises many issues that warrant future research.

Keywords

Bundle recommendation package recommendation constraints recommender system Bandung

1. Introduction

Recommender systems, which recommend products or services for users to consume, have been around since the early 90’s. The goal of these systems is to overcome the problem of information overload. Different techniques have been used, such as collaborative filtering [1], content-based filtering [2] and hybrid filtering [3], which combines the former techniques. Techniques have been become increasingly sophisticated, and the application domains have been extended. Currently, recommendations are no longer only for single items, but also for sets of items, for instance, to plan a tourist route [4], decide what courses to follow [5], or what dishes fit together as a meal.

Package recommendation (also known as bundle recommendation) is defined as the suggestion of a group of items that fit well together [6]. Despite this general definition, the application of package recommendation can take many forms. For instance, in travel recommendations, points of interest (POI) can be combined to propose an ideal route for a user [4, 7, 8]. This type of recommendation can be seen as a sequential package, as the recommended items in the package have to be consumed in a certain order, as this effects the appropriateness of the package. Another example is the recommendation of clothes (e.g. the combination of a skirt and a shirt). In this case, the sequence of the items is of no importance, and the package is called a complementary package.

In the past years, several systematic literature reviews have been conducted in the recommender systems domain. However, to the best of our knowledge, no systematic literature review has been conducted focused specifically on package recommender systems (PRS), which will be the focus of this paper. We believe this is the right time for such as survey, given the increasing interest in PRS (see Fig. 1). The goal of this survey is to find the state-of-the-art in this field, focusing on domains, techniques, and evaluation methods and metrics used.

Figure 1.

Number of PRS articles through the years in the survey.

In the following section, the research method for this systematic literature review will be explained. It contains the research questions, search strategy, criteria and procedures to find, analyze and synthesize the data. Section 3 discusses the results of the review. The domains, package types, recommendation input types, PRS phases and techniques, and evaluation methods and metrics are explained. The last section contains the conclusions of this research and directions for future work.

2. Data collection

2.1 Research questions

The main goal of this research is to find the techniques used for package recommendations and how the performance of each technique is evaluated. To investigate this, we used the following research questions:

1.
In what domains are package recommender systems applied?
2.
What are the different techniques for package recommendation and how do they work?
3.
What evaluation methods are used to evaluate the performance of the package recommendation methods?
4.
What evaluation metrics are used when evaluating the performance of package recommendation methods?

2.2 Search strategy

To find all the relevant information about package recommendation, we defined a search strategy. At first relevant databases were selected to support a good search. The selected databases are listed in Table 1. The second step was to create a search string to search the databases.

Table 1
Search sources

Databases	ACM
	IEEE Xplore
	ScienceDirect
	Scopus
	Google scholar
Searched items	Journal, conference and workshop papers
Search applied on	Article title, abstracts and keywords
Language	English
Search period	1999–2018

To create a search string we started with basic terms such as “package recommendation” as keywords. By informal searches, we found more terms such as “bundle recommendation” and “clustering recommendation”. We then used synonyms for these keywords to come up with the final search string.

“package recommendation” OR “package recommender systems” OR “package recommender system” OR “bundle recommendation” OR “recommending packages” OR “recommending package” OR “clustering recommendation” OR “clustering-based recommendation”

The search with the search string in the different databases is done by the first author. A list was made of all the articles and the double articles were removed. This resulted in 141 articles as can be seen in Table 2. With the inclusion criteria, from Section 2.3, the first inclusion round was performed. The first author reviewed all the articles based on title and abstract and marked all papers with the inclusion criteria it did or did not match. 30 of the 141 articles were included. During the inclusion, a new criterion was added called ‘accessible’. Not every article could be accessed, so inaccessible articles were excluded from the research. The 30 included articles were used to perform the snowballing method. The snowballing method is a method to find new articles based on the current articles by using their references (backward snowballing) or cited by articles (forward snowballing). In this research both methods were applied. The backward snowballing method was applied 1 round, so the references of the 30 included articles were added to a list. The forward snowballing technique was applied until no new articles were found. After the backward- and forward snowballing methods were applied and the double articles were removed, 412 new articles remained from backward snowballing and 400 articles remained from forward snowballing. For the second inclusion round the methods from the first inclusion round were applied on the articles obtained by snowballing. This resulted in 74 included articles, which brings the total to 104 included articles for the research.

Table 2

Inclusion of papers

Database	Search string	1st inclusion	Snowballing		2nd inclusion	Total
			Backward	Forward
ACM	24	2	0	0	0	2
IEEE Xplore	20	0	0	0	0	0
ScienceDirect	7	0	0	0	0	0
Scopus	90	28	0	0	0	20
General	0	0	412	400	74	74
Total	141	30	412	400	74	104

2.3 Inclusion and exclusion criteria

To select the studies suitable for this research, inclusion and exclusion criteria are defined. The inclusion criteria mentioned in the search strategy are defined below. The inclusion criteria are split into two categories. The first category are inclusion criteria that should all be met by an article.

IC1
Publication date of an article should be between 1999 and 2018.
IC2
Language should be English.
IC3
An article should be accessible.
IC4
The article should be peer-reviewed.
IC5
The article should be a journal, conference or workshop paper.
IC6
The focus of the article is on package recommendation as defined in the introduction. If this is retrievable from the title or abstract.

For the second category, every article should at least meet one of the criteria.

IC7
A method is proposed to solve a package recommendation problem.
IC8
An evaluation method is used to evaluate a package recommender system.
IC9
Evaluation metrics are used or compared for the evaluation of package recommender systems.

The exclusion criteria are used after the inclusion criteria are applied and have the function to exclude articles based on certain criteria. If one of these criteria is met, an article will be removed from the list.

EC1
The article is not about package recommendation as defined in the introduction.
EC2
An article is similar to another article and will be removed based on a quality assessment between the two articles.

The criteria were defined before the research took place to avoid bias. However, not all criteria could be foreseen, so IC3 was added during the research.
2.4 Data extraction and synthesis

The guidelines of Kitchenham and Charters [9] have been used for the data extraction. The data extraction is performed by the first and third authors. Both researchers had their own set of articles to review, but there was some overlap to check if data extraction was done in a similar way by both researchers. The disagreements or doubts about articles were discussed by the two researchers and solved by consensus. Checks were done on the data extracted to ensure both researchers agreed on the conclusions drawn. The researchers used an Excel spreadsheet to extract data, which was prepared before the research. The spreadsheet contained several columns, and some columns were added during the research because of new insights. For instance, for the package recommendation methods’ column, phases were identified which resulted in new columns for each phase. Another example is the constraints which were added as a column. This resulted in the following data which has been extracted: review date, title, authors, year of publication, reference, database, package recommendation methods used per phase, constraints, domain, evaluation methods used, evaluation metrics used, research method, data set and size of the data set.

During the data extraction process the exclusion criteria were applied. Also there were a few cases where several articles were written about one research. Through the years the research was expanded, but the core remained the same. In these cases, the data from the articles was extracted, but were processed as if resulting from one article. Finally, some articles were excluded during the extraction process. A good example is P80 (Table 3), which was excluded during the data synthesis process. During the analysis it became clear that no useful data could be extracted from that article. The exclusion and combining of articles during the data extraction and synthesis stage resulted in a total of 79 articles that have been used for this literature review. In Table 3 all the 79 included articles plus article P80 which was excluded during the data extraction, are mentioned. The P-numbers will be used during the research and behind each P-number is a number to indicate the place of the article in the reference list.

Table 3
Article references for articles selected in the systematic literature review

Articles

P1 [10], P2 [11], P3 [12], P4 [13], P5 [14], P6 [15], P7 [16], P8 [17], P9 [18], P10 [19], P11 [20], P12 [21], P13 [22], P14 [23], P15 [24], P16 [25], P17 [26], P18 [27], P19 [28], P20 [29], P21 [30], P22 [31], P23 [32], P24 [33], P25 [34], P26 [35], P27 [36], P28 [37], P29 [38], P30 [39], P31 [40], P32 [41], P33 [42], P34 [43], P35 [44], P36 [45], P37 [46], P38 [47], P39 [48], P40 [49], P41 [50], P42 [51], P43 [52], P44 [53], P45 [54], P46 [55], P47 [56], P48 [57], P49 [58], P50 [59], P51 [60], P52 [61], P53 [62], P54 [63], P55 [64], P56 [65], P57 [66], P58 [67], P59 [68], P60 [69], P61 [70], P62 [71], P63 [72], P64 [73], P65 [74], P66 [75], P67 [76], P68 [77], P69 [7], P70 [78], P71 [79], P72 [80], P73 [81], P74 [82], P75 [83], P76 [84], P77 [85], P78 [86], P79 [87], P80 [88]

Table 4

Package recommendation domains in systematic review articles

Article nr	Domain	# articles
P22, P38–P77, P79	Travel	42
P10–P14	Education	5
P5–P7, P17	E-commerce	4
P20, P25, P27	Movies	3
P1, P24	Books	2
P2, P3	Clothing	2
P21, P23	Travel, movies	2
P26, P28	Travel, movies, books	2
P36, P78	Task assignment	2
P34, P37	(Sport) team selection	2
P4	Cosmetics	1
P8	E-commerce, supermarket	1
P9	E-commerce, furniture	1
P15	Food	1
P16	Gaming	1
P29	Movies, books, electronics, clothing	1
P31	Travel, restaurants	1
P32	Search engine results	1
P33	Software doc. architecture	1
P35	Supermarket	1
P18, P19, P30	Unspecified	3

3. Results

3.1 Domains

As shown in Table 4, PRS have been used in several domains, though substantially more in the travel domain than other domains. As discussed in [89], recommendation domains have different characteristics.

Table 5
Package types in systematic review articles

Article nr	Package type	# articles
P1–P10, P15–P37, P39, P44, P55, P56, P67, P75, P78	Complementary	40
P11, P13, P14, P38, P40–P43, P45–P54, P57–P66, P68–P74, P76, P77, P79	Sequence	38
P12	Unclear	1

In economics, a distinction is made between experience goods (which consumers learn about through experience) and search good (for which direct experience is not needed) [90], and between sensory and non-sensory products [91]. Based on this, Tintarev and Masthoff [89] distinguish between items which are easy to evaluate objectively (such as light-bulbs and cameras) and those which require an experiential and subjective judgement (such as holidays and music). Almost all domains used for PRS are subjective ones.

In economics, cost is also seen as an important characteristic of a domain, with this not only including purchase price, but also the time and effort involved in the purchase/consumption and the psychological, physical, functional and social risk [92, 93, 91]. Tintarev and Masthoff [89] distinguish between high and low investment recommendation domains. Travel tends to be a high investment domain, so most papers are about package recommendation in high investment domains, though there is also work on lower investment domains (such as movies, books, cloths, and food). Additionally, what is a low investment domain when recommending a single item (such as movies and books) becomes a large investment when deciding on a set of movies to see or books to read, as this can still involve a large time investment to consume the package and more costs as paying for multiple items. In a higher investment domain, a user may take longer to decide, so it is likely that users will take longer to decide on packages than on single items, which increases the users’ information needs and requirements for explanations in PRS.

Domain characteristics matter as they may impact the importance of different recommendation quality metrics. For example, a PRS can be incorrect in multiple ways: it can overestimate how much a user may like a particular package, or it can underestimate how much a user may like a package. The first can lead to recommendations the user may not end up liking. The second can lead to the user missing out on packages they may have liked. Research by [94] showed that in general users considered overestimation as less helpful than underestimation, but that overestimation was particularly problematic in high investment domains. They also found that users tended to be more forgiving about over- and underestimation in subjective domains. So, domain characteristics may impact the best method for combining individual items into packages.

Domain characteristics also influence the kind of constraints that are used in the package creation process. For example, in the travel domain, normally items are suggested which are geographically close to each other. For example, bundling “the Great Wall” in China with the “London Tower Bridge” in England would not be a great idea. But coordinating the “London Tower Bridge” with “Buckingham Pallace”, “Big Ben” and “the British Museum” can be considered a good bundle since they are located in the same city. In the travel domain, the location characteristic prunes many items as package candidates.

Table 6

Input types in systematic review articles

Article nr	Input type	# articles
P2, P24, P25, P27, P49, P50	Explicit	6
P7, P14, P16, P17, P29, P32, P35, P48, P65, P79	Implicit	10
P1, P3, P10, P12, P15, P18-23, P30, P38-40, P45, P47, P51-55, P74, P76	Explicit $+$ Features	24
P4–6, P8, P9, P13, P26, P31, P34, P36, P41, P42, P44, P46, P57–64, P66–71, P73, P75	Implicit $+$ Features	30
P28, P43, P56, P72, P77, P78	Explicit $+$ Implicit $+$ Features	6
P11, P37	Features only	2
P33	Unspecified	1

3.2 Package type

Table 5 shows the package types used in the systematic review articles. We distinguish between PRS that serve recommended packages as a sequence of items or as a complementary set of items. In a sequence of items, the items are presented in an ordered list. When the user consumes a recommended package, the user needs to follow the items in the order in which they are given in the list. For example, in the music domain a recommended playlist contains a sequence of songs. In the travel domain, a recommended sightseeing tour package contains a sequence of attractions. In contrast, in a complementary set of items, items are provided as an unordered list. For example, in the grocery domain, a PRS can bundle tea, coffee, and sugar in a package.

Combining the data in Tables 4 and 5, it is clear that in the systematic review articles sequences were only used in the travel and education domains. For some domains, sequences are not really an option. For example, it is hard to envisage a sequence of furniture items, cosmetics, or electronics. For some, it may be possible to do sequences, but it is rather far-fetched: for example, a sequence of cloths to wear over a week. For some, sequences are a possibility, such as movies to watch during a film festival, or books to read over a longer period that are sequentially thematically linked, but the commercial market may be small. In the education domain, there are two kinds of package recommendations: one that recommends a learning path (learning materials to consume in a certain sequence to reach a learning goal) and one that recommends a set of learning materials that are complementary to each other to reach a set of goals. In the travel domain, the vast majority of work is on sequences: typically points of interests are combined into an itinerary. The exceptions are cases where for example a package contains a flight and a hotel. We note that unexpectedly music playlists are not included here, whilst there has been research on playlist recommendation (e.g., [95, 96, 97]). In fact, music is absent as a domain for package recommendation in Table 4. This seems an artifact of the search terms that were used in the systematic review, with people in music recommendation having used the domain specific term “play list recommendation” (instead of using more generic terms such as package, bundle or cluster).

3.3 Package consumer

The PRS can recommend a package to different types of package consumers. It the review, we found two types: individual users and a group of users. When the PRS recommends a package to a user, the PRS needs to consider the user’s preferences. This situation is similar to that of traditional recommender systems, which need to gather and analyze a user’s preferences from different inputs, for example, intrinsic and extrinsic input. When the PRS recommends a package to a group of users, it not only needs to consider each member’s preferences but also needs to aggregate the members’ preferences to provide a solution that is best of the group. In addition, there are situations where PRS are able to recommend to both types of users (individuals and groups). The articles of the systematic review were almost all about package recommendation to individuals, with the exception of P18, P23, P30, P43, and P48 which dealt with package recommendation to groups. The latter were all in the travel domain (or unspecified in terms of domain), which is remarkable as there seems an opportunity for package recommendation to groups also in other domains such as food.

3.4 Recommendation input

As shown in Table 6, PRS differ in the input they use to produce recommendations. Just like traditional individual item recommender systems, PRS can use implicit and explicit input from users. Implicit input is provided unconsciously (for example, item clicks or time spent looking at an item), whilst explicit input is provided consciously (for example, in the form of ratings on a 1 to 5 scale). Contrary to traditional recommender systems which use input regarding individual items, PRS can also use input regarding item packages, such as package ratings or time spent looking at a package.

As seen in Table 6, both implicit and explicit input are used in PRS, with implicit input being used more often. Combining the data from Tables 4 and 6, this seems an effect of domain, with implicit feedback being used a lot in the travel domain (in 30 out of 47, i.e. 64% of cases). For example, a PRS in the travel domain can collect implicit input in the form of the previous journey of a user (containing a travel package or single destination) or the global positioning systems (GPS) coordinates of their current position. In some cases, such input helps the PRS to discard irrelevant items to be recommended. For example, in the travel domain when using GPS coordinates, the PRS discards attractions which are too far away from the user’s current position. PRS focused solely on the movies domain all used explicit input (typically ratings from the MovieLens dataset). There were only 6 PRS articles that combined both implicit and explicit input, so more research could be done on this combination. There was no clear preference for input type in PRS for education, with both implicit and explicit input being used.

Another input type which is often used to recommend packages is the items’ features. For example, in the travel domain, an attraction can have many features, such as a description, attraction type (e.g. museum), opening hours, season, cost, and location. Whilst this input type is also occasionally used in traditional recommender systems as a basis for content-based filtering, in PRS it plays an additional role, in that it forms the basis for constraints (e.g. rules on which colours can be combined in clothing outfits or which locations are good to combine in a tourist route). As seen in Table 6, features are used in most PRS articles surveyed (79% of articles that specified input type). This is even more pronounced in the travel domain (37 out of 42, i.e. 88%, of articles solely on travel). Combining the data from Tables 5 and 6, features are used more in PRS recommending packages as sequences than as complementary sets (84% compared to 73%).

Explicit and implicit data can concern individual items or item packages. Table 7 shows which articles used input data only for individual items, and which also used data for packages. For implicit data, we assumed that travel (point of interest) check-in data provides package data (as one can see sequences in check-ins) and that items bought together also can be regarded as a package. Only some articles (27%) that used only explicit data used package data compared to most articles (83%) that used only implicit data.

Table 7
Only individual vs also package input

Input type	Input data	Articles	#
Explicit	Individual	P1, P18–P24, P27, P39, P40, P45, P47, P49, P50, P52, P53, P55, P76	19
	Package	P2, P3, P10, P12, P15, P25, P54	7
	Unspecified	P30, P38, P51, P74	4
Implicit	Individual	P6, P26, P31, P32, P34, P36, P66	7
	Package	P4, P5, P7–P9, P13, P14, P16, P17, P29, P35, P41, P42, P44, P46, P48, P57–P65, P67–P71, P73, P75, P79	33
Explicit $+$ Implicit	Individual	P28, P56, P77, P78	4
	Package	P43, P72	2
Features only		P11, P37	2
Unspecified		P33	1

3.5 PRS phases and techniques

Researchers have solved the PRS problem using different techniques. In general, they used phases to recommend packages. We have identified three phases and classified articles by the techniques they use in each phase:

•
The model learning phase where the PRS learns several aspects required to produce recommendations such as a user’s preferences.
•
The package creation phase where the PRS mixes and matches items into packages and collects these packages into a package candidates list.
•
The package selection phase where the packages to be recommended are selected from the package candidate list. In this phase, the researcher evaluates the package candidates using techniques such as top-N.

Even though we classified the PRS articles based on the three phases, some articles only used one or two. For example, some researchers used pre-defined packages or random coordination for the second phase, and focused on understanding the user’s preferences and also selecting the best packages to recommend to a user. Tables 8–10 show which techniques were used for each phase in the survey articles.
3.5.1 Model learning phase

In the model learning phase, the PRS uses the input as described in Section 3.4 to obtain knowledge about the user’s preferences and also item characteristics. As shown in Table 8, several techniques are used, in particular clustering, collaborative filtering (CF), user preference modelling, item relationship modelling, and topic modelling.

Table 8
PRS Phase 1: Model learning

Technique		Articles	#
Clustering		P4, P13, P30, P35, P43, P71, P79	7
Collaborative filtering	Matrix factorization (incl. BPR)	P2, P3, P8, P14, P16, P25, P27, P46, P47, P53, P61	11
	Item-based CF	P1, P8, P19, P20, P23, P39	6
	Feature-centric CF	P47	1
	Memory-based CF	P21	1
	Hybrids (e.g. using Gausians)	P54	1
	Other	P28, P58, P59	3
	Unspecified	P10, P49, P57, P67, P70	5
Other user preference modeling	Check-ins	P42, P43, P68, P69	4
	Correlated cross-occurrence	P49	1
	Multi-attribute utility theory	P49	1
	Content-based filtering	P34, P60	2
	Clustering optimisation diversity	P18	1
Item Relationship Modeling	Markov Chain, Probability model	P7, P22, P29	3
	Pattern mining (e.g. Apriori)	P43, P48	2
	Ontology	P38, P77	2
Topic Modeling	LDA	P09, P53, P65, P75	4
	Gibbs sampling	P65, P67	2
	Bernoulli	P65	1
	Bayesian	P44, P67	2
	Restricted Bolzmann machine	P15	1
	TF-IDF	P60	1
	Word embeddings (Course2Vec, CBOW)	P14, P61	2
	Unspecified	P64, P73	2
No model learning		P6, P12, P17, P32, P37, P40, P45, P50–P52, P63, P76	12
Unspecified		P5, P11, P24, P26, P31, P33, P36, P41, P55, P56, P62, P66, P74, P78	14

The algorithm used is also influenced by the PRS input. PRS which only use explicit input such as ratings, tend to use CF methods such as matrix factorization (MF), item-based CF, memory based CF and so on. For example, Wibowo et al. [11, 12], used user-item-rating and user-package-rating matrices of clothes as input and used MF to obtain users’ and items’ latent factors. Combined with an aggregation function (such as minimum, maximum, or harmonic mean), they then used these latent factors to approximate the package rating.

PRS that used unstructured data on items or packages, such as text, often used topic modelling techniques such as Latent Dirichlet Allocation (LDA). LDA parses the text descriptions and automatically extracts specific topics which relate to items or packages. For example, Xiong et al. [83] used travel website information as input and used LDA-based topic analysis to automatically extract the topics. They then matched the extracted topics with the users’ interests. Zhang et al.[62] used LDA to classify the POI categories from content descriptions of each POI. They then determined whether a proposed route (containing a sequence of POIs) is feasible in the sense of containing at least a certain number of POI categories.

Some PRS used item relationship modelling (such as Markov chain, probability model, Apriori and ontology) to model the relation between items. For example, Yu et al. [38] used a Markov chain-based approach to model the relations of particular products with regards to the users’ sequential behaviour. Mikhailov et al. [85] used an ontology to model a.o. the similarity between attractions. With the available survey data no real relationship between the input type and item relationship modelling can be found.

Some PRS used domain specific methods, for example, some travel PRS used check-in data to deduct user preferences. The more a POI is visited by a certain user, the more preferred a certain POI is according to this method. Then the categories of the most visited POIs are determined to calculate the preferences of a user for the categories in the system.

In some cases clustering was used to learn the user model. Clustering is used to find similar groups of items or similar groups of users. For instance, in P4 clusters of items are made to determine the preferences of a user for a certain group of items. Clustering users is used to find users with similar preferences or qualities. For instance in P13, students with similar results were clustered to find what courses fit best with a certain group of students. Clustering is not only applied for individual recommendations, but also for group recommendations. For example, in P30 user clustering is used to determine the preferences of a group of users for a group PRS. Except for P30 all other articles that used clustering made use of implicit input data.

Several PRS did not use model learning, for example, because their package construction did not require user preferences, but only used known item features and constraints. For example, in P37 the goal is to recommend a team consisting of complementary team members. This is based on members’ skills and social fit in a team, so model learning of preferences is not needed. Model learning is also not needed when users explicitly enters their preferences (e.g. as a search query) and only that information is used to make recommendations, so no history or data from other users is used.

3.5.2 Package creation phase

As mentioned above, the package creation phase is used to generate a package candidates’ list. In this process, items are combined with other items into a package.

As can be seen in Table 9, many researchers have regarded this as a knapsack problem, where the solution is a combinatorial optimization in which a collection of items is selected which maximizes the value or minimizes the cost, whilst remaining within certain constraints. For example in the travel domain, the knapsack problem can be defined as how to include as many POIs as possible in an itinerary, whilst remaining within the user’s budget (money and/or time). Several knapsack algorithms were implemented, using for example dynamic programming, search algorithms (e.g., greedy search), and evolutionary algorithms (such as Ant Colony Optimization [71]).1

¹
Ant Colony Optimization iteratively randomized the items involved in a package and evaluated its estimated value (such as travel distance), whilst improving the solutions randomly in several iterations.

Table 9

PRS Phase 2: Package creation

Technique	Articles	#
Knapsack	Greedy (e.g. Ford-Fulkerson)	P9, P10, P11, P19, P21, P28, P31, P32, P33, P39, P40, P48, P52, P54, P63, P70, P76	16
	Random walk	P5, P12	2
	Heuristic	P31, P37, P45, P47, P79	5
	Branch and Bound	P7, P58	2
	Brute force (e.g. Breadth first, Depth first)	P33, P43, P47, P53, P54, P66	6
	Dynamic programming (e.g. Dijkstra shortest path, Floyd-Warshall)	P41, P42, P50–P52, P58, P55	7
	Integer linear programming	P10, P20, P24, P61	3
	Ant colony algorithm	P62	1
	Recurrent Neural Network	P68, P69	2
	Zero-suppressed Decision Diagram (ZDD)	P17	1
Clustering	K-means	P27, P73	2
	Nearest neighbours (incl. case-based reasoning)	P1, P19, P31, P32, P38, P39, P73	6
	Hierarchical	P71	1
	Fuzzy	P26, P36, P73, P78	4
Expanding from pivot items (e.g. Bundles One-By-One)		P4, P31, P32, P39	4
Similarity	Jaccard	P32, P56	2
	Association rules	P35	1
	Apriori	P59	1
Predefined		P25, P44, P75	3
Random selection		P2, P3	2
All possible		P6, P8, P13, P15, P18, P22, P23, P65	4
Constraints		P1, P3, P5, P6, P9–P11, P13, P17–P23, P26, P28, P30–P43, P45, P47, P48, P50–P55, P58–P62, P67–P79	58
Unspecified		P14, P16, P29, P46, P49, P57, P64	7

Some researchers have regarded package creation as a clustering problem, where the package combination is obtained from a common value in a group of data. For example, in the travel domain, POIs can be clustered based on their attributes (such as geographical location, POIs type and so on). In the survey articles, several clustering methods have been applied, but two methods are used the most: nearest neighbours and fuzzy clustering. Nearest neighbours (k-NN) is a clustering method which finds the k most similar items based on a target item. Sometimes k-NN is used by other methods. For instance, the papers which make use of the BOBO algorithm use k-NN to create packages around BOBO’s pivots (target items). The other commonly used method fuzzy clustering is very similar to k-means. Just like in k-means, k clusters of data points are created with fuzzy clustering. However, where a data point with k-means can only belong to one cluster, in fuzzy clustering a data point can belong to multiple clusters. In the survey articles, clustering is used equally often with implicit and explicit data. The input data is in 75% individual item data, and also in 75% of the articles clustering is used to produce complementary packages (rather than sequences). So, there has been most focus on clustering when producing complementary packages based on individual item data.

When a user likes two items individually, this does not necessarily mean that the user would like the combination of the two. For example, somebody may like a red pair of trousers and an orange shirt, but may not want to wear them together. Therefore, when combining items into packages, most papers used constraints. For example, in the clothing domain, constraints have specified which colours, patterns, and formality to combine. Constraints can be manually constructed or learned. In the travel domain, a cost function such as travel time and distance is often used as a basis of constraints (e.g., not to select two items that are too far from each other).

Some papers used predefined packages, or assigned items randomly to packages to create a package list. Others created all possible packages, based on a package model, which specifies the frequency of item types in a package or other constraints. For example, in a clothes PRS, if a package can contain a shirt and a pair of trousers, they would generate all possible combinations of a shirt and pair of trousers. Similarly, in an educational PRS, all possible course sequences can be generated, taking course pre-requisites, maximum number of courses to take in each term, and which course runs in which semester into account (P13).

Table 10

PRS Phase 3: Package selection

Technique	Articles	#
Top-N	P1, P3, P5, P6, P8, P9, P11, P12, P14, P15, P18–P23, P26, P28, P30–P32, P34–P36, P38, P39, P42–P44, P48, P49, P54, P56, P58–P60, P64–P69, P71, P72, P75	45
Top-1	P2, P4, P7, P10, P13, P16, P17, P25, P33, P40, P41, P45, P47, P50, P51–P53, P55, P61–P63, P73, P76, P77, P79	25
Multiple packages unranked	P24, P37, P70, P74, P78	5
Unspecified	P27, P29, P46, P57	4

3.5.3 Package selection phase

Package selection is the last phase of package recommendation. In this phase, the PRS selects a number of packages from a package candidates’ list obtained in the previous phase. As can be seen in Table 10, the most common approach for recommending packages is top-N, whereby the N best packages (as estimated by the PRS) are recommended to the user in a ranked list. Top-1, a sub case of Top-N, where only the best package is recommended is also very popular, and has been used more often than in individual item recommendation. In an individual item recommender, a popular alternative to Top-N is to show all items (not ordered) with a rating system (e.g. stars) to indicate the recommender’s estimated user preferences for each item. This is not really used in PRS, simply because the number of packages tends to be far too large. Some PRS show multiple packages that have been deemed suitable without providing a ranking. It is likely that the number of packages shown to the user depends also on the complexity of packages. For example, when a package contains many items, it is likely that the number of packages shown is smaller. We did not find a domain effect yet: for example, the percentage of travel PRS that used Top-1 is 34% and very similar to the 32% of all PRS that used Top-1.2

²
Whilst many travel PRS produce more complicated packages containing multiple points of interests, some travel PRS just combine flights and hotels, so the domain categories we used is not necessarily a reflection of package complexity.

More investigation is needed into the effect of package complexity and domain on the way recommendations are and should be presented.

3.6 Evaluation

Table 11 shows how PRS have been evaluated.3

³
Some articles used more than one form of evaluation, so the number of evaluation forms can be higher than the total number of articles.

3.6.1 Computational evaluations

The vast majority of articles surveyed did not evaluate the PRS through a user study, but instead used an off-line computational evaluation method. Most (64%) papers that used a computational evaluation method measured accuracy. This is in line with many studies of traditional recommender systems, where the emphasis has been on prediction accuracy and top-N recommendation accuracy.

Table 11
Evaluation methods and metrics in systematic review articles

Evaluation	Metrics		Articles	#
Computation	Accuracy: Rating	RMSE, MSE, MAE, WAPE	P2, P8, P15, P25, P27, P47, P53, P57	8
	Accuracy: Ranking	Kendall’s Tau, Fagin’s intersection metric, Degree of agreement	P22, P44, P67, P73	4
	Accuracy: Relevance	Precision@N	P1, P4, P7, P8, P12, P20, P27, P28, P29, P30, P32, P39, P46, P59, P61, P65, P67, P68, P69, P71	20
		Recall@N	P3, P4, P8, P14, P18, P27, P29, P30, P46, P61, P65, P67, P71	13
		AUC	P16, P29	2
		F ${}_{1}$	P1, P27, P39, P61, P68, P69, P71	7
		Hit-rate, Jaccard similarity, Longest common subsequence	P9, P42, P56, P63, P65, P70, P72	7
	Accuracy: Relevance $+$ Ranking	nDCG	P4, P21, P22, P28, P32, P44, P59, P65, P68, P69, P73	11
		AP, MAP, WAP	P10, P12, P18, P22, P30, P60, P64	7
	Scalability	Execution time, Processing time	P5, P6, P10, P11, P17, P20, P23, P24, P26, P28, P31, P33, P34, P40, P41, P43, P45, P47, P48, P53, P54, P58, P59, P62, P63, P76, P78, P79	28
		Memory storage	P59	1
	Utility	Package value/cost, Order size	P7, P8, P13, P20, P21, P23, P24, P31, P43, P45, P47, P53, P62, P68, P69, P71, P72, P79	18
	Coverage		P5, P15, P16, P24, P31, P37, P41, P56	8
	Diversity	Inter-, Intra package diversity	P1, P28, P31, P32, P39, P68, P69, P70	8
	Cohesion		P23, P32, P37	3
	Perplexity		P67, P75	2
	Novelty		P44	1
User study	Satisfaction	Perceived usefulness, usability	P3, P23, P40, P43–P45, P48, P49, P50, P51, P63, P66, P67, P76	14
		Retention	P78	1
	Performance	Throughput, Execution time	P36, P78	2
	Accuracy: Relevance $+$ Rankings	MRR	P49	1
		Acceptance rate, Click-to-open, Conversion-to-open	P7, P9, P36, P78	4
Expert study			P33, P52, P55, P76	4
None			P19, P35, P38, P74, P77	5

Accuracy. Prediction accuracy normally measures the extent to which the predicted item ratings correspond with the actual item ratings4

⁴

Where actual item ratings are not necessarily given explicitly, but can be inferred through implicit input.

[98]. Standard measures include Mean Absolute Error (MAE), Mean Square Error (MSE), and Root Mean Square Error (RMSE), whilst one paper (P8) also used the less often used Weighted Average Percentage Error (WAPE). Less frequently, prediction accuracy is based on the accuracy of the ranking, mostly measured by using the correlation between the predicted ranked list and the actual ranked list, for example using Kendall’s rank correlation coefficient (Kendalls’ tau) or Fagin’s intersection metric (P22).

Prediction accuracy treats errors in predictions for good and bad items equally, whilst recommender systems tend to only show a limited number of items to a user with good rating predictions. Therefore, recommender systems are also often evaluated on the relevance of the recommendations in a ranking situation using a top-N recommendation task [99].5

⁵

Metrics such as Fagin’s intersection metric are in between these two categories: they consider rankings and can be applied to top N.

Standard measures for this are precision@N and recall@N, measures that combine precision and recall such as Area Under Curve (AUC) and

F_{1}

, and measures that reflect whether the user selects/consumes the package recommended. Similarly to normal recommender systems, the latter can be measured through the hit-rate, for example measuring the proportion of recommended packages that are clicked on. However, whilst in a normal recommender system, a recommendation is either used or not, in a PRS the situation is more complex. Users could for example use part of a recommendation. Therefore, several surveyed papers used the Jaccard similarity metric, which measures the similarity between two sets (the set of items in the recommended package compared to the set of items actually selected). When the package recommended is a sequence, one can also look at the longest common subsequence, as the Jaccard similarity metric does not take order into account.

Additionally, some accuracy measures of relevance take the ranking positions into account such as normalized Discounted Cumulative Gain (nDCG), Average Precision (AP), Mean Average Precision (MAP), and Weighted Average Precision (WAP). Table 11 shows that accuracy measures based on relevance are most popular in PRS evaluations.

Many papers used a combination of different accuracy metrics. For example, papers that measured the accuracy of predicted item rankings, often also measured the relevance of the recommendations using nDCG. Similarly, papers that measured the relevance of recommendations using nDCG also reported Precision@N (with the exception of P21).

Measuring accuracy requires a gold standard: actual ratings6

⁶

Explicitly or implicitly acquired.

(and/or rankings) for recommended packages. As was shown in Table 7, many PRS did not use package ratings. In such cases, there is no real gold standard. Indeed accuracy was measured more often in papers that used package ratings: 71% of papers that used package ratings compared to 40% of papers that used only individual ratings. Remarkably, this still means that 12 papers measured accuracy despite only having individual item data. In such cases, typically a package was deemed as good as the average rating of the package items. Another issue with the computational measurement of accuracy is that from the 42 articles that used package ratings, 33 articles only used implicit package ratings. Implicit data often struggles to make a distinction between items the user has seen and disliked, and items the user has never seen (the latter will be very frequent in PRS).

More recently, researchers have been arguing against only using accuracy measures of recommender systems, and advocating to use also measures such as coverage, confidence, diversity, novelty, serendipity, utility, and scalability [100].

Scalability is the extent to which the PRS can deal with larger data sets (in terms of processing power required and speed). In the papers reviewed, after accuracy, scalability was evaluated the most, with scalability being measured in 42% of the papers which had computational evaluations. A high proportion (75%) of papers that evaluated scalability did not evaluate accuracy.

Utility is the extent to which a recommendation is useful, so its value. Two kinds of value can be distinguished: value to the package consumer and value to the package provider. In the surveyed articles, the ways to calculate package value were often domain specific. For example, a book PRS used revenue gain to measure value for the package provider. A travel PRS used travel time as one way to measure value for the package consumer (with a higher travel time meaning lower value). An educational PRS used grade point average and graduation time, which can measure value for both package provider and consumer. Utility was measured in 27% of the papers which had computational evaluations. In about half these papers, it replaced the accuracy measurement.

Coverage is the percentage of users for which the system can provide recommendations and/or the portion of items that can be recommended. Standard measures include the Gini index and Shannon entropy. Sometimes the performance of the system can be measured also specifically for new (‘cold’) items or users (who have fewer than a certain number of ratings). Coverage was measured in only 12% of the papers which had computational evaluations.

Diversity is traditionally the extent to which recommendations are dissimilar from each other. In PRS, two types of diversity are used, intra-package diversity and inter-package diversity. Intra-package diversity is the diversity of items within a package. Inter-package diversity is the diversity between recommended packages. Out of 8 articles, only P31 used inter-package diversity. Intra-package diversity is calculated as 1 minus the average similarity between any two items in the package, which is typically calculated based on the item features. Diversity was measured in only 12% of the papers which had computational evaluations; all but one of these used features.

Cohesion is the extent to which items within a package belong together in terms of similarity. This is typically the opposite of intra-package diversity. Only 3 articles explicitly considered cohesion. The trade off between cohesion and diversity seems domain dependent. In some domains (e.g. team recommendation) cohesion may well be more important, whilst in other domains diversity may be more important.

Confidence is the system’s trust in its own predictions. We did not find any papers explicitly mentioning confidence, but did find two papers using the related concept Perplexity, which is a measure of uncertainty.

Novelty is the extent to which users were unaware of recommended items.7

⁷

Another measurement is serendipity: the extent to which successful recommendations are surprising to the user (e.g. a recommendation of a new book by their favourite author may be novel, but not surprising). Serendipity was not measured for PRS at all, neither in the computational evaluations nor user studies.

This was only evaluated in one paper (P44). To be able to measure novelty, one needs to know which packages a user will consume in future, so typically a dataset which includes time, so that recommendations are based on data up till a certain point of time.8

⁸

Splitting the data in the earlier time period into a training and test set as usual, whilst adding the later time period to the test set.

Novelty can then be measured by looking whether a recommended item was already known to the user (so consumed earlier) or not. P44 used an implicit dataset for this.

In principal, it is possible to evaluate aspects such as accuracy, coverage, diversity and scalability also during a user study, in which study participants interact with a PRS. However, most papers surveyed used these metrics in an off-line setting, mostly using an existing data set as a basis for both the system and the evaluation (mostly using n-fold cross-validation, whereby a part of the data is used to inform the system and a part to evaluate it, and this is repeated n times).9

⁹

In some cases (e.g. P18), a user study is done to create a new dataset, but this user study is not used to directly measure the performance of the system, but rather to construct a dataset that is used again in an off-line setting.

,10

¹⁰

In some cases (e.g. P25) one dataset is used to produce the PRS and another to evaluate the PRS.

This is why most papers using these metrics are presented under the computation evaluation method in the table.

3.6.2 User studies and expert evaluations

Only 23% of articles contained a user study: 13 articles combined a computational evaluation with a user study, and 5 contained only a user study. As shown in Table 11, user studies mainly focused on user satisfaction with the package recommendations (including perceived usefulness), though there was also some work on accuracy and performance. Only 3 papers contained an expert evaluation; one of these combined an expert evaluation with a user study. The expert evaluation all focused on satisfaction.

Satisfaction is how pleased users (or experts) are with the PRS and its recommendations. Satisfaction was normally measured through surveys which ask participants’ opinions about recommendations (perceived usefulness, so related to the utility measurements in the computational evaluations). Sometimes the usability of the PRS was measured (e.g. P51, P66, P76). One paper (P78) considered user retention: so how long they would keep using the PRS. Another measure that has been mentioned for recommender systems is the users’ trust in the system [100]. This can be measured by considering how many recommendations are followed or by asking users whether they find the recommendations reasonable. It is often hard to measure trust independently from user satisfaction, and as can be seen, the measurements taken in PRS for satisfaction are implicitly measuring trust as well.

Performance. To measure performance, the throughput or execution time was used. For instance, throughput was measured by counting the number of completed tasks per minute (P36, P78). Execution time is quite similar, but is focused on one task and how long it takes for that specific task to be selected and completed (P36).

Accuracy measurements in the user studies were all related to the relevance of recommendations, and either used the mean reciprocal ranking (MRR; a variant of MAP that was used in the computational evaluations), or the rate at which users accepted recommendations, clicked on them, or consumed them (so, similar to the hit-rate in the computational evaluations, but now based on the data of users who had actually used the PRS).

4. Conclusions

This paper presented a systematic review of PRS, which after applying inclusion and exclusion criteria looked at 79 articles in detail. This area of recommender systems’ research is still relatively immature. We note the following challenges for PRS which require future work.

4.1 Need for more open data sets that contain package ratings

Many recommender systems’ data sets have been released, but so far most are for the recommendation of individual items. Some researchers have creatively created package data sets from available data sets using several assumptions. For example, P39 created travel packages for users by using individual POI ratings. By combining these POI ratings and taking the POI popularity and intra package diversity into account a score for each package was calculated. The problem is that combining several items a user likes does not mean that the combination of those items is also appreciated by the user. Combining two pieces of clothes that somebody likes does not have to mean that they would like the combination of those items as an outfit.

A few researchers have produced their own package data sets through explicit user ratings in studies, but these data sets are still very limited in size and only available in some domains. For example, Wibowo et al. [11, 12] collected package recommendations in the clothes domain by asking participants to rate a “top” of clothes (such as a shirt) and a “bottom” of clothes (such as pants) individually and as a package. For another example, Sharma et al., P25 [34] collected ratings for sets of movies from active users of Movielens on movies each user had rated individually in the past. Researchers have often resorted to the use of implicit package data, but whilst this gives some insights in what users tend to consume together, and may therefore like, this is not necessarily indicative of the best possible combinations (as users may just not have been aware of other options) and does often make it hard to distinguish between packages which the user dislikes and packages which the users has just never noticed. Package data sets are not just important as a basis for recommendations, but are vital to get a reliable measurement of accuracy. Using the average of individual item ratings to produce a gold standard for the package rating is in many cases not right. For example, if a user adores a red pair of trousers and adores an orange shirt, this does not necessarily make this a great outfit. Similarly, a user may really like the British museum and the Victoria and Albert museum, but may be unlikely to combine them in a one day outing. Whilst other measures (such as cohesion and diversity) may contribute to the estimation of the goodness of a package recommendation, without actual user package ratings (or other ways of gather user opinions on packages), it is hard to reliably measure accuracy, or even to investigate the impact of diversity and cohesion on (perceived) accuracy. Therefore, for the field of PRS to progress, the creation of large open data sets that contain both individual and package ratings is crucial. Given the reliance of most PRS systems on features for the construction of packages, such data sets also need to contain item features.

4.2 Need for more sophisticated ways of dealing with data sparsity and package cold starts

Package rating matrices are even sparser than individual item rating matrices. There has been some initial research on ways to reduce matrix sparsity[101]. Additionally, whilst individual item recommender systems often suffer from user cold start problems (difficulty to recommend to new users who have not rated anything yet) and item cold start problems (difficulty to recommend new items which have not been rated by anybody yet), PRS have a package cold start problem: the difficulty of recommending a new package. Each new item that is added could potentially lead to a very large number of new packages that contain that item. More research is needed on how to deal with data sparsity in the package rating matrix and how to deal with the package cold start problem. Given it is unlikely that this problem can be fully solved for large scale real world systems, there is a need for research on accurate estimations of package ratings based on individual item ratings, (sparse) ratings of other packages, and item features. This research will need to be done in multiple domains, as this is certainly to a large extent domain dependent.11

¹¹
Domain types may be distinguished, which may share certain parts of this accuracy estimation, for example the importance of package cohesion.

This research will require user studies to validate the estimation formulas.

4.3 Need for more efficient algorithms

Package recommendation is computationally complex, with many approaches to model learning (e.g. of package preferences) and package creation NP-hard. It is therefore important to produce more efficient algorithms, which make optimal use of heuristics (such as constraints) that are based on evidence-based insights on which item combinations go well together in a certain recommendation domain.

4.4 Need for more sophisticated metrics

Most evaluations used the same metrics as used for individual item recommender systems. There is a need for metrics specifically developed for PRS. For example, most papers used traditional accuracy metrics, whilst in a PRS, particularly when package size increases, users may decide to consume part of a package. This means that it is no longer solely a question of comparing packages consumed (as whole entities) to packages recommended, but that the content (the items in the set) of what is actually consumed needs to be compared against what has been recommended, and that in case of the package being a sequence, also the order of consumption of individual items needs to be considered. In such cases, traditional accuracy metrics (such as Precision@N which was most popular in the papers surveyed) no longer suffice. Only a few papers used Jaccard similarity and longest common sub-sequence to perform such a comparison that takes package content into account. Even those metrics will need more work, and will need adjusting to fully capture the complexity of PRS evaluation. For example, where in the sequence the longest common sub-sequence is (for example, at the start or the end) may matter for users’ perceptions of whether the recommendation was followed and useful. If the longest common sub-sequence in a travel package was at the start, perhaps a user really liked the package, but ran out of time (or got lost). Similarly, more work is needed on diversity and cohesion metrics. Additionally, user studies could benefit from a reliable scale to measure appreciation with packages, so that users’ opinions on package details (such as the start, finish, cohesion, diversity, serendipity) can be measured. Existing questionnaires are mainly focused on individual item recommendations, and to the best of our knowledge there is no validated scale for PRS.

4.5 Need for more comprehensive evaluations and user studies

Whilst most papers contained an evaluation, predominantly these were computational evaluations and evaluations of accuracy. Clearly, there is more to the goodness of a recommendation than accuracy, just as has been argued for single item recommender systems as discussed above. Consuming a package requires more investment by the user (in time and/or money) than consuming an individual item. Recommending packages is also more complicated than recommending individual items. Whilst most of the current studies use computational evaluations, this is not enough to understand this complex problem. Users need to be more involved in the evaluations by doing user studies. This could help to better understand how package recommendation works and what a good package is.

4.6 Need for more domains to be studied

The focus of package recommendation till this point is mainly on the travel domain (with each other domain only studied in a couple of papers). However, there are many other domains that could be broader studied. This could result in a deeper and broader understanding of the package recommendation field.

References

Resnick

Iacovou

Suchak

Bergstrom

Riedl

. GroupLens: an open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM conference on Computer supported cooperative work. ACM; 1994. pp. 175-186.

Krulwich

Burkey

. Learning user information interests through extraction of semantically significant phrases. In: Proceedings of the AAAI spring symposium on machine learning in information access. Menlo Park: AAAI Press 1996. pp. 100-112.

Balabanovic

Shoham

. Fab: content-based, collaborative recommendation. Communications of the ACM. 1997; 40(3): 66-73.

Kumar

Jerbi

O’Mahony

. Towards the Recommendation of Personalised Activity Sequences in the Tourism Domain. In: RecTour 2017 2nd Workshop on Recommenders in Tourism. Como, Italy, 27 August 2017. ACM; 2017.

Yueh-Min

Tien-Chi

Wang

Hwang

. A Markov-based recommendation model for exploring the transfer of learning on the web. Journal of Educational Technology & Society. 2009; 12(2): 144.

Ricci

Rokach

Shapira

. Introduction to recommender systems handbook. In: Recommender systems handbook. Springer; 2011. pp. 1-35.

Baral

Iyengar

Balakrishnan

. CLoSe: C ontextualized Lo cation Se quence Recommender. In: Proceedings of the 12th ACM conference on recommender systems. ACM; 2018. pp. 470-474.

Dugani

Dixit

Belur

. Automated adaptive sequential recommendation of travel route. In: Computing Methodologies and Communication (ICCMC), 2017 International Conference on. IEEE; 2017. pp. 284-288.

Kitchenham

Charters

. Guidelines for performing systematic literature reviews in software engineering. 2007.

10.

Zhu

Shen

Ting

Gang

. A Package Recommendation Model Based on Credit and Time. DEStech Transactions on Computer Science and Engineering. 2017; (wcne).

11.

Wibowo

Siddharthan

Lin

Masthoff

. Matrix Factorization for Package Recommendations. In: Proceedings of the RecSys 2017 Workshop on Recommendation in Complex Scenarios (ComplexRec 2017). CEUR-WS; 2017.

12.

Wibowo

Siddharthan

Masthoff

Lin

. Incorporating Constraints into Matrix Factorization for Clothes Package Recommendation. In: Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization. ACM; 2018. pp. 111-119.

13.

Liu

Chen

Xiong

Chen

. Modeling buying motives for personalized product bundle recommendation. ACM Transactions on Knowledge Discovery from Data (TKDD). 2017; 11(3): 28.

14.

Basu Roy

Amer-Yahia

Chawla

Das

. Constructing and exploring composite items. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM; 2010. pp. 843-854.

15.

Sessoms

Anyanwu

. SkyPackage: From finding items to finding a skyline of packages on the semantic web. In: Joint International Semantic Technology Conference. Springer; 2012. pp. 49-64.

16.

Zhu

Harrington

Tang

. Bundle recommendation in ecommerce. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM; 2014. pp. 657-666.

17.

Beladev

Rokach

Shapira

. Recommender systems for product bundling. Knowledge-Based Systems. 2016; 111: 193-206.

18.

Iqbal

Kovac

Aryafar

. A Multimodal Recommender System for Large-scale Assortment Generation in E-commerce. arXiv preprint arXiv180611226. 2018.

19.

Parameswaran

Venetis

Garcia-Molina

. Recommendation systems with complex constraints: A course recommendation perspective. ACM Transactions on Information Systems (TOIS). 2011; 29(4): 20.

20.

Papaemmanouil

Koutrika

. CourseNavigator: interactive learning path exploration. In: Proceedings of the Third International Workshop on Exploratory Search in Databases and the Web. ACM; 2016. pp. 6-11.

21.

Suri

Gao

Xia

Börner

Liu

. Enter a job, get course recommendations. iConference 2017; Proceedings Vol 2. 2017.

22.

Xing

Van Der Schaar

. Personalized course sequence recommendations. IEEE Transactions on Signal Processing. 2016; 64(20): 5340-5352.

23.

Morsy

. Learning Course Sequencing for Course Sequence Recommendation. 2018.

24.

Pan

Zhang

. Combo-Recommendation Based on Potential Relevance of Items. In: Asia-Pacific Web Conference. Springer; 2016. pp. 505-517.

25.

Pathak

Gupta

McAuley

. Generating and personalizing bundle recommendations on Steam. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM; 2017. pp. 1073-1076.

26.

Shirai

Tsuruma

Sakurai

Oyama

Minato

. Incremental set recommendation based on class differences. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer; 2012. pp. 183-194.

27.

Mengash

Brodsky

. GCAR: A Group Composite Alternatives Recommender Based on Multi-Criteria Optimization and Voting. In: 2014 47th Hawaii International Conference on System Sciences. IEEE; 2014. pp. 1113-1121.

28.

Khabbaz Xie

. Efficient Algorithms for Recommending Top-k Items and Packages. 2011.

29.

Kouris

Varlamis

Alexandridis

. A package recommendation framework based on collaborative filtering and preference score maximization. In: International Conference on Engineering Applications of Neural Networks. Springer; 2017. pp. 477-489.

30.

Xie

Lakshmanan

Wood

. Composite recommendations: from items to packages. Frontiers of Computer Science. 2012; 6(3): 264-277.

31.

Interdonato

Romeo

Tagarelli

Karypis

. IEEE A versatile graph-based approach to package recommendation. 2013; pp. 857-864.

32.

Mamoulis

Pitoura

Tsaparas

. Recommending packages with validity constraints to groups of users. Knowledge and Information Systems. 2018; 54(2): 345-374.

33.

Lauw

Wang

. Mining revenue-maximizing bundling configuration. Proceedings of the VLDB Endowment. 2015; 8(5): 593-604.

34.

Sharma

Harper

Karypis

. Learning from Sets of Items in Recommender Systems. arXiv preprint arXiv1904 12643. 2019.

35.

Leroy

Amer-Yahia

Gaussier

Mirisaee

. Building representative composite items. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM; 2015. pp. 1421-1430.

36.

Ortiz

Chasi

Chalco

. Clustering-Based Recommender System: Bundle Recommendation Using Matrix Factorization to Single User and User Communities. In: International Conference on Applied Human Factors and Ergonomics. Springer; 2018. pp. 330-338.

37.

Kouris

Varlamis

Alexandridis

Stafylopatis

. A versatile package recommendation framework aiming at preference score maximization. Evolving Systems. 2018; pp. 1-19.

38.

Wang

Chen

. ProductRec: Product Bundle Recommendation Based on User’s Sequential Patterns in Social Networking Service Environment. In: 2017 IEEE International Conference on Web Services (ICWS). IEEE; 2017. pp. 301-308.

39.

Mengash

Brodsky

. Tailoring Group Package Recommendations to Large Heterogeneous Groups Based on Multi-Criteria Optimization. In: 2016 49th Hawaii International Conference on System Sciences (HICSS). IEEE; 2016. pp. 1537-1546.

40.

Amer-Yahia

Bonchi

Castillo

Feuerstein

Mendez-Diaz

Zabala

. Composite retrieval of diverse and complementary bundles. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(11): 2662-2675.

41.

Bota

Zhou

Jose

Lalmas

. Composite retrieval of heterogeneous web search. In: Proceedings of the 23rd international conference on World wide web. ACM; 2014. pp. 119-130.

42.

Villavicencio

Schiaffino

Díaz Pace

. Solving Package Recommendation Problems with Item Relations and Variable Size. In: Argentine Symposium on Artificial Intelligence (ASAI 2015)-JAIIO; 44 (Rosario, 2015); 2015.

43.

Xie

Lakshmanan

Wood

. Generating top-k packages via preference elicitation. Proceedings of the VLDB Endowment. 2014; 7(14): 1941-1952.

44.

Fang

Xiao

Wang

Lan

. Customized Bundle Recommendation by Association Rules of Product Categories for Online Supermarkets. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). IEEE; 2018. pp. 472-475.

45.

Amer-Yahia

Gaussier

Leroy

Pilourdault

Borromeo

Toyama

. Task composition in crowdsourcing. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2016. pp. 194-203.

46.

Dorn

Skopik

Schall

Dustdar

. Interaction mining and skill-dependent recommendations for multi-objective team composition. Data & Knowledge Engineering. 2011; 70(10): 866-891.

47.

Castillo

Armengol

Onaindía

Sebastiá

González-Boticario

Rodríguez

, et al. samap: An user-oriented adaptive system for planning tourist visits. Expert Systems with Applications. 2008; 34(2): 1318-1332.

48.

Benouaret

Lenne

. Recommending diverse and personalized travel packages. In: International Conference on Database and Expert Systems Applications. Springer; 2017. pp. 325-339.

49.

Chen

Zhou

Tung

. Automatic itinerary planning for traveling services. IEEE Transactions on Knowledge and Data Engineering. 2013; 26(3): 514-527.

50.

Gionis

Lappas

Pelechrinis

Terzi

. Customized tour recommendations in urban areas. In: Proceedings of the 7th ACM international conference on Web search and data mining. ACM; 2014. pp. 313-322.

51.

Hti

Desarkar

. Personalized Tourist Package Recommendation Using Graph Based Approach. In: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization. ACM; 2018. pp. 257-262.

52.

Reddy

Subramaniyaswamy

. An enhanced travel package recommendation system based on location dependent social data. Indian Journal of Science and Technology. 2015; 8(16): 1.

53.

Tan

Liu

Chen

Xiong

. Object-oriented travel package recommendation. ACM Transactions on Intelligent Systems and Technology (TIST). 2014; 5(3): 43.

54.

Lai

Wang

. Travelbuddy: interactive travel route recommendation with a visual scene interface. In: International Conference on Multimedia Modeling. Springer; 2014. pp. 219-230.

55.

Zhao

Yang

Lyu

King

. STELLAR: spatial-temporal latent ranking for successive point-of-interest recommendation. In: Thirtieth AAAI conference on artificial intelligence; 2016.

56.

Zhang

Liang

Wang

Sun

. Personalized trip recommendation with poi availability and uncertain traveling time. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM; 2015. pp. 911-920.

57.

Hsieh

. Constructing trip routes with user preference from location check-in data. In: Proceedings of the 2013 ACM conference on Pervasive and ubiquitous computing adjunct publication. ACM; 2013. pp. 195-198.

58.

Herzog

Massoud

Wörndl

. Routeme: A mobile recommender system for personalized, multi-modal route planning. In: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. ACM; 2017. pp. 67-75.

59.

Wörndl

Hefele

Herzog

. Recommending a sequence of interesting places for tourist trips. Information Technology & Tourism. 2017; 17(1): 31-54.

60.

LaßHerzog

. Context-Aware Tourist Trip Recommendations. In: RecTour 2017 2nd Workshop on Recommenders in Tourism. RecTour; 2017. pp. 18-25.

61.

Herzog

Wörndl

. A Travel Recommender System for Combining Multiple Travel Regions to a Composite Trip. CBRecSys@ RecSys. 2014; 1245: 42-48.

62.

Zhang

Liang

Wang

. Trip recommendation meets real-world constraints: POI availability, diversity, and traveling time uncertainty. ACM Transactions on Information Systems (TOIS). 2016; 35(1): 5.

63.

Liang

Wang

. Top-k route search through submodularity modeling of recurrent POI features. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM; 2018. pp. 545-554.

64.

Wörndl

Ludwig

Herzog

. Recommending Customized Trips Based on the Combination of Travel Regions. In: ENTER 2015 Conference, Lugano, Switzerland, February 3–6 2015.

65.

Interdonato

Tagarelli

. Personalized recommendation of points-of-interest based on multilayer local community detection. In: International Conference on Social Informatics. Springer; 2017. pp. 552-571.

66.

Sang

Mei

Sun

. Probabilistic sequential POIs recommendation via check-in data. In: Proceedings of the 20th international conference on advances in geographic information systems. ACM; 2012. pp. 402-405.

67.

Wei

Zheng

Peng

. Mining popular routes from social media. In: Multimedia Data Mining and Analytics. Springer; 2015. pp. 93-116.

68.

EHC

Fang

Tseng

. Integrating tourist packages and tourist attractions for personalized trip planning based on travel constraints. GeoInformatica. 2016; 20(4): 741-763.

69.

Shao

Shen

Huang

Shen

. Unifying multi-source social media data for personalized travel route planning. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM; 2017. pp. 893-896.

70.

Ramamohanarao

. A Jointly Learned Context-Aware Place of Interest Embedding for Trip Recommendations. arXiv preprint arXiv180808023. 2018.

71.

Yang

Zhang

Sun

Guo

Huai

. A Tourist Itinerary Planning Approach Based on Ant Colony Algorithm. In: International Conference on Web-Age Information Management. Springer; 2012. pp. 399-404.

72.

Hsieh

Lin

. Measuring and recommending time-sensitive routes from location-based data. ACM Transactions on Intelligent Systems and Technology (TIST). 2014; 5(3): 45.

73.

Jiang

Qian

Mei

. Personalized travel sequence recommendation on multi-source big social media. IEEE Transactions on Big Data. 2016; 2(1): 43-56.

74.

Rakesh

Jadhav

Kotov

Reddy

. Probabilistic social sequential model for tour recommendation. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM; 2017. pp. 631-640.

75.

Yang

Liu

. Exploring personalized travel route using POIs. International Journal of Computer Theory and Engineering. 2015; 7(2): 126.

76.

Liu

Chen

Xiong

. A cocktail approach for travel package recommendation. IEEE Transactions on Knowledge and Data Engineering. 2012; 26(2): 278-293.

77.

Baral

Zhu

. CAPS: Context Aware Personalized POI Sequence Recommender System. arXiv preprint arXiv180301245. 2018.

78.

Yang

Guo

. Personalized travel package with multi-point-of-interest recommendation based on crowdsourced user footprints. IEEE Transactions on Human-Machine Systems. 2015; 46(1): 151-158.

79.

Baral

Iyengar

Zhu

. HiCaPS: hierarchical contextual POI sequence recommender. In: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM; 2018. pp. 436-439.

80.

Biwas Creed

. Itinerary Planning Using Top-k Package Recommendation and Multiple Constraints.

81.

Chen

Zhang

Xing

. A package generation and recommendation framework based on travelogues. In: 2015 IEEE 39th Annual Computer Software and Applications Conference. vol. 2. IEEE; 2015. pp. 692-701.

82.

Boulakbech

Cheniki

Messai

Sam

Devogele

. Linked Data Graphs for Semantic Data Integration in the CART System. In: International Conference on Web Engineering. Springer; 2018. pp. 221-226.

83.

Xiong

Liu

. A situation information integrated personalized travel package recommendation approach based on TD-LDA model. In: 2015 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC). IEEE; 2015. pp. 32-37.

84.

Chang

Tsai

. ATIPS: automatic travel itinerary planning system for domestic areas. Computational intelligence and neuroscience. 2016; 2016: 1.

85.

Mikhailov

Kashevnik

. An Ontology for Service Semantic Interoperability in the Smartphone-Based Tourist Trip Planning System. In: 2018 23rd Conference of Open Innovations Association (FRUCT). IEEE; 2018. pp. 240-245.

86.

Alsayasneh

Amer-Yahia

Gaussier

Leroy

Pilourdault

Borromeo

, et al. Personalized and diverse task composition in crowdsourcing. IEEE Transactions on Knowledge and Data Engineering. 2017; 30(1): 128-141.

87.

Chen

Zhang

Guo

Pan

. TripPlanner: Personalized trip planning leveraging heterogeneous crowdsourced digital footprints. IEEE Transactions on Intelligent Transportation Systems. 2014; 16(3): 1259-1273.

88.

Jeffries

Brodsky

. Composite Alternative Pareto Optimal Recommendation System with Individual Utility Extraction (CAPORS-IUX). In: ICEIS (1); 2018. pp. 328-335.

89.

Tintarev

Masthoff

. Evaluating the effectiveness of explanations for recommender systems. User Modeling and User-Adapted Interaction. 2012 Oct; 22(4): 399-439. Available from: 10.1007/s11257-011-9117-5.

90.

Shapiro

. Optimal pricing of experience goods. The Bell Journal of Economics. 1983; pp. 497-507.

91.

Cho

Fjermestad

Roxanne Hiltz

. The impact of product category on customer dissatisfaction in cyberspace. Business Process Management Journal. 2003; 9(5): 635-651.

92.

Laband

. An objective measure of search versus experience goods. Economic Inquiry. 1991; 29(3): 497-509.

93.

Murphy

Enis

. Classifying products strategically. Journal of Marketing. 1986; 50(3): 24-42.

94.

Tintarev

Masthoff

. Over-and underestimation in different product domains. Workshop on Recommender Systems associated with ECAI; 2008. pp. 14-19.

95.

Baccigalupo

Plaza

. Case-based sequential ordering of songs for playlist recommendation. In: European Conference on Case-Based Reasoning. Springer; 2006. pp. 286-300.

96.

Liu

Rauterberg

. Music playlist recommendation based on user heartbeat and music preference. In: 2009 International Conference on Computer Technology and Development. vol. 1. IEEE; 2009. pp. 545-549.

97.

Liebman

Saar-Tsechansky

Stone

. Dj-mc A reinforcement-learning agent for music playlist recommendation. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems; 2015. pp. 591-599.

98.

Bobadilla

Ortega

Hernando

Gutiérrez

. Recommender systems survey. Knowledge-Based Systems. 2013; 46: 109-132.

99.

Cremonesi

Koren

Turrin

. Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the fourth ACM conference on Recommender systems. ACM; 2010. pp. 39-46.

100.

Shani

Gunawardana

. Evaluating recommendation systems. In: Recommender systems handbook. Springer; 2011. pp. 257-297.

101.

Wibowo

. Generating Pseudotransactions for Improving Sparse Matrix Factorization. In: Proceedings of the 10th ACM Conference on Recommender Systems. RecSys ’16. New York, NY, USA: ACM; 2016. pp. 439-442. Available from: 10.1145/2959100.2959107.

Package recommender systems: A systematic review

Abstract

Keywords

1. Introduction

2.1 Research questions

Table 1 Search sources

Table 3 Article references for articles selected in the systematic literature review

3.1 Domains

Table 5 Package types in systematic review articles

3.3 Package consumer

3.4 Recommendation input

Table 7 Only individual vs also package input

Table 8 PRS Phase 1: Model learning

1 Ant Colony Optimization iteratively randomized the items involved in a package and evaluated its estimated value (such as travel distance), whilst improving the solutions randomly in several iterations.

2 Whilst many travel PRS produce more complicated packages containing multiple points of interests, some travel PRS just combine flights and hotels, so the domain categories we used is not necessarily a reflection of package complexity.

3 Some articles used more than one form of evaluation, so the number of evaluation forms can be higher than the total number of articles.

Table 11 Evaluation methods and metrics in systematic review articles

4. Conclusions

4.1 Need for more open data sets that contain package ratings

4.2 Need for more sophisticated ways of dealing with data sparsity and package cold starts

11 Domain types may be distinguished, which may share certain parts of this accuracy estimation, for example the importance of package cohesion.

4.4 Need for more sophisticated metrics

4.5 Need for more comprehensive evaluations and user studies

4.6 Need for more domains to be studied

References

Table 1
Search sources

Table 3
Article references for articles selected in the systematic literature review

Table 5
Package types in systematic review articles

Table 7
Only individual vs also package input

Table 8
PRS Phase 1: Model learning

¹
Ant Colony Optimization iteratively randomized the items involved in a package and evaluated its estimated value (such as travel distance), whilst improving the solutions randomly in several iterations.

²
Whilst many travel PRS produce more complicated packages containing multiple points of interests, some travel PRS just combine flights and hotels, so the domain categories we used is not necessarily a reflection of package complexity.

³
Some articles used more than one form of evaluation, so the number of evaluation forms can be higher than the total number of articles.

Table 11
Evaluation methods and metrics in systematic review articles

¹¹
Domain types may be distinguished, which may share certain parts of this accuracy estimation, for example the importance of package cohesion.