Abstract
Proliferation of information is a major confront faced by e-commerce industry. To ease the customers from this information proliferation, Recommender Systems (RS) were introduced. To improve the computational time of a RS for large scale data, the process of recommendation can be implemented on a scalable, fault tolerant and a distributed processing framework. This paper proposes a Content-Based RS implemented on scalable, fault tolerant and distributed framework of Hadoop Map Reduce. To generate recommendations with improved computational time, the proposed technique of Map Reduce Content-Based Recommendation (MRCBR) is implemented using Hadoop Map Reduce which follows the traditional process of content-based recommendation. MRCBR technique comprises of user profiling and document feature extraction which uses the vector space model followed by computing similarity to generate recommendation for the target user. Recommendations generated for the target user is a set of Top N documents. The proposed technique of recommendation is executed on a cluster of Hadoop and is tested for News dataset. News items are collected using RSS feeds and are stored in MongoDB. Computational time of MRCBR is evaluated with a Speedup factor and performance is evaluated with the standard evaluation metric of Precision, Recall and F-Measure.
Introduction
Rapid development in the arena of information technology has revolutionized the world by providing access to information at the click of a button. Information on the Web is growing at a massive rate and is expected to further grow exponentially with time. This phenomenon is known as information overload. When a user accesses Web to satisfy his query, he is exposed to a vast pool of data in various varieties and struggles to find relevant answers to his query. Researchers are making a continuous effort to help the user with this problem of information overload by developing some personalized system.
Recommender Systems (RS) forms a major solution to the problem of information overload. RS are special system software which assists the user in the struggle he faces in making a correct choice from huge varied list of available options. They present the user with relevant content from the vast pool of information. Information is considered relevant if it satisfies the user. RS broadly fall into two categories as non-personalized or personalized RS. Non-personalized recommendations do not consider user’s behavior and likes. Recommendations generated by these RS include top selling items in case of item recommendation, mostly played tracks in case of song recommendation, top news articles in case of news recommendation etc. Personalized recommendations works on the principle of analyzing the user’s behavior and generating recommendations which well suits his likes and tastes. They filter and display the content taking into account the active user’s preferences and likes. RS are widely used in various applications which involve generating recommendations from a large amount of variety data. With the increasing information at disposal, developing scalable, computationally efficient and fault tolerant RS is still an area of research.
The issue of scalability, fault tolerance and improvement in the computational time for a RS can be addressed by developing a RS on a platform which is capable of processing data in a distributed mode, supporting master slave architecture where the data can be replicated on multiple slave nodes to ensure fault tolerance. A framework which complies with these characteristics is the Apache Hadoop [1]. Apache Hadoop enables fault tolerant, distributed processing of data and ensures scalability by executing the applications on a cluster of nodes. To handle large amount of variety data a schemaless database is needed. A NoSQL database is capable of handling structured, semi structured and unstructured data generated on Web due to its schema less property and scalability of data via sharding [9].
In this work we propose a RS implemented on Hadoop and NoSQL database to ensure computational efficiency in generating recommendations for large dataset, to handle scalability of data and to create a fault tolerant RS in addition to accuracy. The proposed recommender system is a personalized RS based on the Content-based recommendation (CBR) technique. The proposed RS implements traditional CBR as Map Reduce Content-Based Recommendation (MRCBR) on a distributed framework Apache Hadoop using Map Reduce programming paradigm. The proposed MRCBR technique comprises of Map Reduce based Vector Space Model (MRVSM) – to create user profile and to extract item features using Map Reduce paradigm. This technique is detailed in [8]. Map Reduce based Cosine Similarity (MRCS) –to retrieve items based on similarity between the user profile and item features. MRCS implements cosine similarity measure using Map Reduce.
Recommendations generated for the target user is a list of Top N items. Our work evaluates the performance of MRCBR and CBR using the standard evaluation metric of Precision, Recall and F-Measure [12]. Scalability of the proposed technique is evaluated using Speedup factor [14]. Structure of rest of the paper is as follows: Section 2 presents the related work on RS, user profiling, Vector Space Model and distributed RS. Introduction of the methodologies used in the proposed system is discussed in Section 3. Section 4 elaborates the framework of the proposed system. Experiments and results are presented in Section 5 followed by conclusion in Section 6.
Related work
Recommender Systems have become a major area of research in the field of Information Retrieval (IR) and Artificial Intelligence. Researchers working in this domain are posed with various challenges. One of the challenges is to improve the accuracy and the computational time for generating the recommendations when provided with large dataset. A lot of work has been done which addresses different issues and challenges. This section presents the literature review concerned with the subject.
Recommender system and techniques
A variety of techniques are available for generating recommendations for the target user. Plethora of literature using different techniques of recommendation exists. These techniques are -Collaborative Filtering based recommendation (CF), Content-based recommendation (CBR), Knowledge-based recommendation and Hybrid recommendation. Jannach in his work [13] explained all the techniques of recommendations in detail. Pazzani et al. [15] discussed the content-based recommendation (CBR) implemented on a non parallelized framework. Exhaustive literature exists on RS which uses the above mentioned recommendation techniques. To discuss all of them is beyond the scope of this paper. However [13, 15] highlighted the basic methodology of CBR which is used by the proposed system. To summarize, the process of Content-based recommendation involves creating user profiles, feature extraction of the items followed by generating recommendations for the target user using a similarity measure between the items features and the target user profile. The proposed technique of MRCBR uses Content-based recommendation implemented on a distributed computing platform Hadoop. Existing literature states diverse ways to create user profiles and to extract features from the items. The proposed technique uses the concept of Vector Space Model (VSM) [5] for the same. The paper next describes the existing state of literature on user profiling and VSM.
Vector Space Model (VSM) and user profiling
Various techniques for creating user profiles and item feature extraction exist in literature. This section highlights some of the related work for creating user profiles on a distributed framework and also discusses the use of VSM for different purposes.
Turney et al. [24] identified three categories of VSMs as word-context, pair-pattern and term-document. A personalized TV program filtering system was proposed by Zhi-Wen et al. [28]. The authors used VSM and created vectors on a dimensional space which represented users and programs. The weighting scheme for creating term vector was not specified in their work. Relevant programs were then filtered by computing similarity between the user vector and program vector. Map Reduce implementation for profiling users is also supported by the literature. Scheuer et al. [22] employed Kullback-Leibler method for user profiling. In their work, user profiles were represented as textual content. Proposed system was implemented on Hadoop Map Reduce. Gautam et al. [8] created user profile using the concept of VSM and implemented the same on Hadoop using Map Reduce paradigm.
Distributed recommender system
Research in RS addresses developing distributed RS using different recommendation techniques. It was found from the literature that the researchers have used Hadoop and other technologies to develop a distributed scalable RS. Han et al. [11] proposed a distributed CF algorithm. The proposed algorithm of PipeCF was implemented on a Peer-to-Peer structure to construct a scalable and distributed RS. A personalized News recommender system using the CF technique which catered to a large community of users was developed by Das et al. [6]. Recommendations were generated by means of MinHash clustering, PLSI implemented on Map Reduce and co visitation count. The proposed RS was evaluated using precision and recall. To support scalable RS some researchers have made an attempt to build a RS using Map Reduce programming paradigm. Jiang et al. [14] proposed a parallelized item-based CF on Map Reduce. The proposed algorithm was evaluated using performance metric of speedup and isoefficiency. Results showed that a good scalability is achieved by the proposed algorithm. Refs. [3, 29] used collaborative filteringrecommendation technique using Map Reduce implemented on Hadoop. Vinodhini et al. [25] developed a RS by collecting the ratings from the users and identifying the features of the items. Hadoop Map Reduce is used in their work to identify the features of the items by using word count method. Pessemier et al. [16] developed a content-based recommender system on Hadoop Map Reduce. They implemented the standard technique of content-based recommendation on Wikipedia articles using Map Reduce. Content characterization of the articles is done using Map Reduce implementation of TF-IDF followed by a similarity measure and finally generating recommendations for the user. Dooms et al. [7] proposed a distributed content-based RS to generate recommendations in a reasonable time, having a large data at disposal. The work implemented an in-memory content-based algorithm for recommendation without using Map Reduce, which can be distributed across machines to improve the computational time of the system by introducing data parallelism. Research on Hadoop framework gives enough evidence for its appropriateness for large scale data processing [4].
Contribution
Literature survey highlighted that researchers have used Hadoop Map Reduce technique for implementing a RS. Implementing RS using CF is addressed by a large community of researchers without using Map Reduce [11] and with using Map Reduce [3, 29]. However, this work is an attempt to implement a CBRS for large scale data using Hadoop Map Reduce by using functional parallelism as opposed to data parallelism [7]. Map Reduce implementation of word count is used [25] to extract the item features while our proposed work uses Map Reduce based technique of TF-IDF [17] for feature extraction over the raw frequency [19]. A CBRS was developed using Hadoop Map Reduce [16, 21]. Authors in [16] did not create user profile when using content-based recommendation which is a major limitation of their proposed work. Also the proposed framework is made to run on Hadoop with a single node. The proposed system supports user profiling in addition to the content characterization of the documents, followed by computing similarity between them and recommending documents to the users. To demonstrate the improvement in computational time for a large dataset, MRCBR is made to run on a Hadoop cluster with 5 nodes. Literature implementing a distributed RS evaluates the proposed system on the basis of computational time required to generate recommendations and omits the need to evaluate the generated recommendations [7, 29]. Accuracy and precision are used to evaluate the RS [6, 25]. The proposed RS is evaluated using the standard metric of Precision, Recall and F-Measure to test the recommendation quality in addition to the computational time.
Introduction to methodologies used in the proposed system
Recommender system (RS)
Advent of e-commerce has led to massive information. To show the relevant content to user, RS plays an important role. Four techniques of recommendations are Content-based Recommendation, Collaborative Filtering based Recommendation, Knowledge based recommendation and Hybrid Recommendation which are discussed below. Of these the proposed system uses content-based recommendation which is discussed at length.
Content-based recommendation (CBR)
In CBR each item which is considered for recommendation is referred as a document [13]. The RS which uses CBR tends to identify the features of the items or documents to create profiles of the user based on the identified document feature. To recommend documents to the user, it does not require a large community of users who have used the system. The document descriptions assist the recommendation process. Depending on the application domain these document descriptions can vary. A new document is recommended to the user by identifying its features rather than finding community of users with same interest. CBR addresses the problem of item cold start. In a nutshell CBR aims at finding unseen documents with similar features to the ones which the user has already seen and liked in the past.
The idea of CBRS is to analyze the history of a user to find out documents the user has liked or rated in the past. Based on those documents the system then tries to recommend him similar unseen documents. Content provider (which could be a web page, RSS feeds, Item descriptions, documents etc.) acts as a data source to populate the system with documents which are stored in data repository. The next step is to represent documents in the form of features or attributes. A popular technique of Vector Space Model [19] is used for this purpose. CBR exploits the history of the user to recommend documents. To analyze user history, user profile is created which is expressed in terms of features extracted from the documents liked or rated by the user in the past. To recommend relevant and yet unseen documents from a large set of documents, document filtering is done.
The process of document filtering is done by computing the similarity of the user profiles with the document features. The system then recommends top N documents to the target user. For all the recommended documents a feedback mechanism is used to learn about the relevance of the recommended documents and to update the user profile. The proposed system uses this technique of recommendation to recommend relevant documents to the target user.
Apache Hadoop
Emergence of internet and e-commerce has led to data deluge. To process and store this voluminous data, traditional relational databases provide an incompetent solution. This requires a novel model to store and process the growing data with the capability to process data in parallel across multiple machines, to work in master slave architecture. To meet this requirement for voluminous data Apache Hadoop and Map Reduce [2] forms the key ingredients.
Hadoop
Hadoop is an open source framework by Apache Software Foundation, intended to work on distributed systems. Hadoop can be set up in a standalone as well as in a cluster configuration. Building blocks of this framework are a file system which stores the data known as Hadoop Distributed File System (HDFS) and Map Reduce engine for processing the stored data in HDFS.
Hadoop Distributed File System (HDFS)
HDFS is a file system capable of storing voluminous data on the nodes. For a Hadoop cluster it stores data distributed across the nodes of the cluster. A file to be stored in HDFS is divided into blocks which are stored on separate nodes. The default size of each block is 64 MB. HDFS comprises of Namenode, Secondary Namenode and Datanode to store and manage data. Namenode comprises of the metadata for directories and files residing in the file system. Datanode stores the actual data by storing the directory blocks, file mappings and namespace tree. The request of accessing data in a cluster is addressed by HDFS client by sending a request to the Namenode which returns a list of Datanodes where all the blocks of a file are residing. The property of fault tolerance of the system is assured by replicating each block of a file is on various Datanodes in the cluster. Secondary Namenode helps to recover from Namenode failure by keeping a copy of the namespace image which can then be used in case of Namenode failure.
Map Reduce
HDFS, a file system within the Hadoop framework possess the capability of distributed storage of data. Processing this distributed data residing across the cluster demands for a parallel processing programming paradigm which is catered by Map Reduce supported by Hadoop. Map Reduce comprises of Jobtracker and Tasktracker to process the data in master slave architecture as is shown in Fig. 1. The processing of each task in Hadoop is done in two phases namely- Map and Reduce.
In the Hadoop master-slave architecture, master executes Jobtracker which coordinates and controls all other slaves by running Tasktracker to accomplish the assigned task. Master node processes the input data by dividing it in fixed size units identified as spilt. Each split is processed on the slave node by executing the map task. The input to the map function is in the form of <key, value>pair which processes it to generate the intermediate output in the same form. The intermediate output is written out to the local disk which is given as input to the reducer for further processing. Reducer processes the input by generating a <key, value>output pair. To summarize, in a Hadoop cluster there is a master node which comprises of Namenode and Secondary Namenode from the HDFS and JobTracker from the Map Reduce engine as shown in Fig. 1. All the slave nodes in the cluster have Datanode and TaskTracker to store data and execute operations supporting parallel data storage and parallel data processingrespectively.
Proposed work
The proposed technique Map Reduce based Content Based Recommendation (MRCBR) employs content-based recommendation technique implemented on Hadoop Map Reduce framework to recommend documents from a large set. The proposed technique is scalable, fault tolerant and supports distributed processing of dataset.
Framework of the proposed system is illustrated in Fig. 2. Content provider provides the documents from the Web which is stored in the NoSQL database by the administrator. The web data sources contribute data in the form of structured, semi structured and in unstructured form. The system uses MongoDB to store this variety data. Recommendations for target user are generated by MRCBR which is deployed on Hadoop cluster of varying size. To generate recommendations, data from MongoDB is loaded into HDFS to enable Map Reduce tasks on the data.
The technique of MRCBR comprises of – feature extraction, user profiling and document filtering as depicted in Fig. 2. Distributed processing is achieved by implementing the technique of MRCBR using Map Reduce framework and deploying it on Apache Hadoop. To generate recommendations for the target user, MRCBR implements two Map Reduce based techniques firstly Map Reduce based Vector Space Model (MRVSM) to create user profile and to extract item features using Map Reduce paradigm. Secondly Map Reduce based Cosine Similarity (MRCS) to find the similarity between the user profile and item features. The prototype of proposed system is developed and tested for news items which are gathered from RSS feeds for different news channels. News items being textual can be expressed as a collection of words or terms which are considered as feature set for a news item. To store the semi structured and voluminous news items the system uses a NoSQL database MongoDB [9].
MRVSM approach for feature extraction and user profiling
Feature extraction process involves identifying relevant set of keywords which represents a document [13]. Documents being rich in text are not readily available for processing thus placing a need to pre-process them. Documents are pre-processed using the standard technique of stop word filtering and stemming. Each document contains some words which are connectors, prepositions and are frequently occurring in the text, without extending their contribution hence not so strong to represent. These words are called Stop words. It includes a, an, like, the etc. Stop word filtering is done to give more importance to terms which represents a document strongly while the process of stemming involves reducing variation of a term which generally has analogous semantic explanation to the root word or to the stem. The proposed system uses Porter Stemmer [27] for stemming the documents. The pre-processed documents are then used for feature extraction.
MRCBR makes use of the Map Reduce based Vector Space Model (MRVSM) for document feature extraction and also for creating user profiles. MRVSM uses Term Frequency-Inverse Document Frequency (TF-IDF) for weighting the terms within a document to create a term vector. For each document (input for feature extraction) in MongoDB and document read by the user (input for user profiling), MRVSM computes TF-IDF. TF-IDF associates a value of significance with each term in a document, to create a term vector for it. The algorithm for MRVSM is detailed in [8]. To create term vector with same dimension for each document, the system first defines dimension size for each vector by creating a dictionary from all the documents present in the repository. This dictionary is a set of all the terms in the document space. After defining the dimensions of the vector, TF-IDF for each term ‘t’ in the document squon′i ∈ N is computed as the product of Equations (1, 2) where N is the set of documents in the repository.
MRVSM for a document ni is expressed in Equation (3).
In order to create user profile for a user, N is the set of all documents read or seen by the user.
To recommend documents to the user based on user’s interest MRCBR requires filtering documents (based on target user’s profile) from the entire corpus of documents. In the proposed system a user’s profile is expressed in terms of keywords or terms vector as in Equation (3). Also the entire corpus of documents is represented in terms of term vectors. To recommend relevant documents to the user, a similarity matching between the two term vectors is required. A most popular and common similarity measure is the Cosine Similarity [13]. Cosine Similarity is mathematically expressed as:
Value of lies between 0 and 1. If for two documents, value of this measure is close to 1 it indicates a strong similarity between them and value close to 0 indicates dissimilar documents. MRCBR uses Map Reduce based Cosine Similarity (MRCS) to find pair wise cosine similarity. Algorithm for MRCS is stated in Algorithm 1. The final list of recommendation is generated by identifying Top N documents on the basis of system defined threshold.
The previous sub-sections explained and stated the algorithm for the proposed technique. Figure 3 shows the data flow of the proposed system to generate recommendations for the target user. In Fig. 3 arrows labeled “a.” depicts the data flow for MRVSM i.e. for user profiling and for document feature extraction. Arrows labeled “b.” indicates the data flow for MRCS. For extracting features from the documents, documents collected from content provider (source of semi structured or unstructured data) and are stored in MongoDB. This input data in the case of user profiling is the user‘s interaction with our system which is captured from the web portal and is stored in MongoDB. The data from MongoDB is then exported into HDFS. Before importing data in HDFS to eventually execute Map Reduce tasks on it, some data transformation is needed. To accomplish the task of feature extraction, user profiling and finally generating recommendations using Hadoop Map Reduce, data from plain textual format is required to be transformed into a format supported by Map Reduce operations. Map Reduce processes data represented in the form of <key, value>pair. To achieve this data transformation, the proposed system converts the input data files to a sequence file format [26] which represents data in a binary, <key, value>pair. After converting the input data into a sequence file, it is then loaded into HDFS to execute the desired MR tasks of feature extraction, user profiling and set of document recommendations. This section gave a detailed insight of the recommendation process which the proposed system follows. Next section investigates the performance of the proposed MRCBR technique.
Experimental results and analysis
Experiments are conducted to substantiate the performance and improvement in the computational time for the proposed technique of MRCBR. The experiments were conducted on News dataset. Performance of the system is evaluated using the Precision, Recall and F-Measure and the computational time is evaluated using Speedup factor. In the subsequent subsections, Section 5.1 describes the dataset used for experimentation. Section 5.2 details the experimental setup followed by Section 5.3 which defines the evaluation metrics for the experiment. Results of the experiment are discussed in Section 5.4.
Dataset specification
Experimental study for the proposed technique is conducted on News dataset. News items are collected from different news channels in the form of RSS feeds which exhibits the property of being semi structured (in the form of XML) in nature. For each news item- title, description, RSS feed name, link and date are stored under newsdata collection in MongoDB [9]. Newsdata collection comprises of 48K news items. Details of each user are stored in users collection in MongoDB.
Experimental setup
The proposed system is developed on Map Reduce paradigm which is executed on a Hadoop cluster. For experimentation Hadoop 0.20.2 is installed on Ubuntu 12.04 in single node and multi-node configuration with a cluster of 5 nodes [2]. A NoSQL database MongoDB 2.4.12 is used for storing the news items [9].
To evaluate the performance of the proposed system, the experiments are conducted for three cases which are described as:
Case A: Evaluating recommendations using Text Search feature of MongoDB.
MongoDB supports the feature of text search which enables searching string content (search term) in the documents within a collection. The process of text search is initiated by first pre-processing the text using tokenization and stemming followed by generating a score for every document where a match for the search term occurs. This score is interpreted as the relevance of the document for the search term. The search term can be a single term or a phrase. Text search in MongoDB is disabled by default. Enabling this feature is initiated by first creating a text index on the field to carry out text search on. After creating a text index on the field, the text command is used along with the search terms as its parameter. Our work implements this feature of MongoDB by creating a text index on the title field in the newsdata collection to retrieve all the news items which are similar to the news items read by the target user in the past by using title as the search term. It then computes the similarity between the news items viewed by user with all the news items stored in “newsdata” collection. The output of this command is a set of news items along with a score associated with each of them. A set of top N unseen news items is taken into consideration for recommendation.
Case B: Evaluating recommendations for a CBRS (a non-parallelized approach).
In this case an implementation of a RS based on non-parallelized approach CBRS is evaluated for accuracy.
Case C: Evaluating recommendations for MRCBR.
For the proposed system this paper evaluates the accuracy and the computational time it takes for the system to generate recommendations by varying the number of nodes in the Hadoop cluster.
Evaluation metrics
The evaluation of MRCBR and CBR in terms of accuracy is computed using the standard metrics in the field of information retrieval – Precision, Recall [12] and F-Measure. Precision and Recall are expressed as Precision@N and Recall@N where Top N documents are considered for recommendation. The computational time for MRCBR is evaluated using a Speedup factor [14, 29].
Precision is defined as the ratio of relevant documents selected to the number of documents selected. It is computed using Equation (5). Precision is interpreted as the probability that the selected item is relevant.
Recall is the fraction of relevant items to the total number of relevant items at disposal. It is computed as in Equation (6). Recall is interpreted that the relevant item will be selected.
F-Measure is defined using precision and recall as shown in Equation (7).
To evaluate the computational time for the Case C described in Section 5.2, a Speedup factor is computed which is defined using Equation (8)
Experiments were conducted for the Cases A, B and C defined in Section 5.2 to evaluate Precision, Recall and F-Measure. Speedup factor is computed for the case C to evaluate scalability of MRCBR in addition to the computational time.
Results
Table 1 shows the results of Precision, Recall and F-Measure for various values of N for the cases A, B and C. Table 2 records the computational time for MRCBR to generate recommendations with varying number of nodes in the Hadoop cluster and news items available in the system. Speedup factor for MRCBR was computed and is tabulated in Table 3.
Discussions
Results from Table 1 and Fig. 4 depict the accuracy of recommendations for the three cases discussed in Section 5.2. It is inferred from the results that the accuracy is at par for all the three cases for the same value of N which varies from 5 to 25. However to predict the recommendations with a good level of accuracy it is important to select the right threshold.
Table 2 and Fig. 5(a) illustrates the computational time it takes for MRCBR to generate recommendations by varying the dataset size and the number of nodes in the Hadoop cluster. It is observed that for fixed dataset size the computational time of MRCBR decreases with the increase in the size of the cluster, inferring that Hadoop works efficiently for large number of data items. Also while maintaining constant cluster size and increasing the dataset significantly the computational time did not increase radically.
Table 3 and Fig. 5(b) demonstrate the speedup factor for MRCBR. It is observed that the speedup factor increases significantly with the increasing number of nodes in the Hadoop cluster for 12K, 24K and 48K dataset size thus maintaining the scalability of MRCBR.
Conclusion
This paper presented a new approach which addressed the problem of improving the computational time for generating recommendations from a large scale data by proposing a CBRS implemented on a distributed platform-Hadoop. The proposed technique of MRCBR was implemented on Map Reduce paradigm which enabled parallel processing of data.
The proposed technique MRCBR comprised of two techniques namely Map Reduce based Vector Space Model (MRVSM) for creating user profiles and document feature extraction and Map Reduce based Cosine Similarity (MRCS) which computed the cosine similarity between the user profile and document features. The prototype of the system was implemented on News dataset. Results showed that accuracy of recommendations from CBR and MRCBR were at par. It was experimentally shown that the proposed technique of MRCBR supported scalability as the speed up factor increased with the increase in the size of the Hadoop cluster. The computational time is also improved for MRCBR by increasing the number of nodes in the Hadoop cluster for a fixed size dataset.
Footnotes
Acknowledgments
The authors duly acknowledge University Grants Commission (UGC) for funding this research work via UGC MRP Grant No. [42-139/2013 (SR)] and UGC Junior Research Fellowship (JRF) Ref No.: 3492/(NET-DEC. 2012).
