Abstract
With the abundance of technology, digital natives prefer to be more comfortable and effortless. They expect an instant, correct and crisp response to their queries. Keeping this in mind, this paper proposes an agent, HELP, which fetches the unstructured data from the web, aligns it with the structured knowledge base by transforming it into rules and updates the CLIPS knowledge base progressively. It helps to conquer the issue of format non-uniformity. The proposed model is simulated for a technical education university. The objective is to provide an interface to the newcomers who are unfamiliar with the rules, regulations, and policies of the new environment and at the same time reluctant to pose to seniors or faculty.
Keywords
Introduction
Today everyone is dictated by convenience. The use of the abundance of technology on a daily basis has changed the way digital natives think and interact. It is further changed their expectations. Now, they look for effortless engagement and immediate responses, at any moment in their “round-the-clock” lifestyle [1]. Moreover, instead of reading the lengthy document, they are more interested in concise and to the point answer to their queries. The information that a person seeks in an environment already exists in the form of texts either offline or online. The only problem is tantamount to retrieve the crisp and correct information on demand. There arises a need for a smart and intelligent agent which can search for the correct information easily and produce it flawlessly. A robust knowledge base is a core requirement for such an agent. Such systems can be employed as an inquiry system or question answer system (QAS) situated in a large variety of domains such as schools, universities, organizations, customer care centers, hospitals, railways, airports, banks, entertainment.
Question answering systems (QAS) serve a good solution to these problems. The existing QAS answers the queries either from the text [1, 2, 3] or the knowledge base [4, 5, 6]. While knowledge base (KB) methods are effective at answering compositional questions, their performance is often affected by the incompleteness of the KB. Moreover, it requires knowledge engineering with the help of domain experts which are costly and time-inefficient. Lots of information is available online in textual form. However, it is highly unstructured and cannot be directly employed to answer the queries. So, another challenge is of knowledge extraction from unstructured text [7]. Thus, there is a high need for aligning the structured knowledge base with the unstructured text on a common space. It is important to take into account the knowledge base to answer the queries due to its generative capacity. At the same time, the unstructured text is important in order to keep the model updated with the continuously changing and rapidly increasing data.
However, QAS working with the combination of structured knowledge base and unstructured text is challenging owing to the structural non-uniformity. Different researchers [8, 9, 10, 11, 12, 13, 14] address this problem partially by means of aligning text patterns with KB. But the rich and ambiguous nature of the language allows a fact to be expressed in many different forms which these models are failed to capture.
In this paper, we propose an agent, HELP, that fetches the unstructured data from the web, align it with the structured knowledge base by transforming it into the rules and update the CLIPS KB dynamically. It helps to conquer the issue of format non-uniformity. Moreover, a common inference engine is used to answer all kinds of queries sufficiently as the unstructured data are also represented in the form of rules and facts. The proposed model is simulated for a technical education university. The objective is to present an interface to the newcomers who are unfamiliar with the rules, regulations, and policies of the new environment and at the same time reluctant to pose to seniors or faculty. The advantages of such a system are no need of expert all the time, up-to-date information available and snappy answers to the queries.
The paper is structured as follows. First, the similar work in the same domain is discussed followed by detail discussion on the proposed model and its architecture. Then, implementation and results are discussed. Finally, the paper is concluded with limitations and directions for the future work.
Related work
In recent years, several kinds of methods and algorithms have been suggested by researchers in various domains [1, 2, 15, 16, 17, 18]. One such system presents a structure for obtaining complex relations from codes and texts in the form of rules using natural language processing (NLP) tools and text matching methods. The structure of their system consists of existing knowledge and free texts. The knowledge is represented in the form of Web Ontology Language (OWL) whereas the free text is in the unstructured format. The unstructured input is converted to a format and then the grammatical category is identified using Stanford Parser. The model is simulated for Gynecology, a medical field. The limitation of the system is that in order to extract rules, knowledge base should be in the form of OWL ontology as Semantic Web Rule Language (SWRL) works only with OWL [2]. Another researcher has developed their own learning engine for testing real time data. Here, the author has also compared different association based algorithms and then propose best suited algorithm that can be used to generate rules for the data. Various association rule algorithms that are compared were Apriori, Predictive Apriori, FPGrowth and Tertius. From the analysis of different association rule algorithms, Tertius Algorithm was found to be best suitable as it was able to generate valid rules in the shortest possible time. The algorithms were tested on real time data by implementing their Learning Engine in NTPC power plant. The demerit of the learning engine is that the information used for testing is already structured and the structure of the knowledge base should be strictly in the form of if-then only [1]. In another research work, the authors have used a supervised machine learning algorithm for detecting short messages started by mobile malware on the basis of characteristics derived from the content of those messages. Here, authors have compared the detection abilities of various Machine Learning Algorithms like support vector machines (SVM), k-nearest neighbors algorithm (KNN), Decision Trees, Random Forests and Multinomial Naive Bayes. The algorithms were compared in three separate cases. In first case, all small messages are treated independent of each other while in the second case, half of the messages are treated as training dataset. In the last case, the classifier is trained on the dataset up to a particular period of time and is tested for the upcoming messages. From this research paper, the author concludes that all the machine learning (ML) techniques perform extremely well with the average accuracy of 98%. Random Forest (RF) outperforms other algorithms with accuracy of 99.36% [15].
Diverse kinds of work are also performed on Hindi language for classification and simplification purpose. One such work presents a Hybrid approach for determining sentimental texts or phrases from Hindi text and categorizes them into positive, negative and neutral on the basis of their polarity. The dataset of 1000 sentences has been prepared after collecting resources from websites, blogs, discussion forums, etc. The different phases of the model include Text Pre-processing, Part of Speech (POS) Tagging, Hybrid Approach which include Rule-based model and Statistical-based model. The proposed classification model generates result with 70% accuracy. The limitation of their method is that during testing the experiment is conducted on a small dataset in a specific domain [16]. Another work based on Hindi language provides an improved annotation scheme for indirect anaphora in Hindi based on Emille corpus. Here, the methodology of the work consists of four parts. Firstly, the selection of corpus in Hindi was done followed by identification of characteristics that define indirect anaphora. After that proposal was validated using ML techniques and then designing of a classification system for indirect anaphora. The drawback of this model is that author was not able to produce required results due to lack of desirable rules. Diminutive size of dataset was another limitation of the research work [17]. On the similar track, research work was carried out focusing on the problem of simplifying complex sentences into multiple simple sentences using linguistic resources in Hindi language. The linguistic resources used comprise of verb demand frame and conjuncts’ list. For the purpose of testing, both human and automated evaluation was performed. Although they have achieved a good score during evaluation, however, it cannot handle complex predicates as they are generative in nature [18].
Proposed model
Agreeing on the need of an agent which can instantly give a response to our query without expecting much effort from us, we propose an agent, HELP, that works with the combination of the knowledge base and unstructured text to process the query. Initially, it has an underlying knowledge base (KB) developed by domain expert [4]. Further, it extracts the information from the web to keep its knowledge up-to-date, then transforms the data into facts and (if-then) rules, and aligns them with the CLIPS KB to keep the format uniformity. Afterward, CLIPS inference mechanism along with backward chaining algorithm (implemented in JAVA [4]) is utilized to respond to the user’s query. This way the agent, HELP, keeps its KB up-to-date without any intervention of an expert all the time. Moreover, every information (structured or unstructured) is aligned at a commonplace which eradicates the problem of format non-uniformity and hence allows to use the same inference mechanism to deal with all type of information. Figure 1 displays the architecture of the proposed model.
Architecture of proposed model, HELP.
Here, the raw data is crawled from online relevant sources which are then preprocessed and forwarded to Classifier module. This module takes the decision of categorizing the sentences. After categorizing, it is further passed to the Knowledge Extractor. This module transforms the sentences into facts and rules. This structured information is now updated in the knowledge base in form of rules for further inferencing.
The crawler crawls the unstructured or semi-struct-ured data from the web and passes it to the preprocessor. The preprocessor decomposes the complete text into sentences and tag each word with appropriate part of speech. It also tags the sentence as simple, complex or compound. A simple sentence commonly composed of a subject and a predicate. Sometimes, it can also have an object. A complex sentence is composed of numerous simple sentences which are dependent on each other. While two or more simple sentence combined into a single sentence with the coordinating conjunction forms a compound sentence. Table 1 shows a few examples of different sentence categories.
Examples of sentence category
Examples of sentence category
Analysis of classifiers
Confusion matrix of multinomial naive bayes classifier
Examples of knowledge extraction and its structuring
Examples of knowledge extraction and its structuring in technical education university
The module classifier plays an important role in updating the knowledge base with new facts and rules. Upon investigation, it is found that the queries generally asked by the users lies into four classes: WHAT-IS, WHERE, WHAT-HAPPEN and WHEN. This classification helps in understanding whether the sentence should be represented as a fact or a rule and how. The sentences, belonging to “WHAT-IS or WHERE” query class, are the simple atomic sentences and can be used directly for inferencing. Hence, these sentences should be stored as facts in the knowledge base. The sentences, belonging to “WHAT-HAPPEN” or “WHEN” class, are compound or complex sentences and should be stored as (if-then) rules in the knowledge base. Now, the challenging task is to identify the query-class of any sentence. A classifier is required for this task. We picked Multinomial Naïve Bayes, random forest, SVM with linear kernel and non-linear kernel to select the best classifier for the proposed model. The Section 4 discusses the pros and cons of all these classifier for the proposed model. Based upon the analysis, we selected Multinomial Naïve Bayes classifier [19] for query-class identification. Multinomial Naïve Bayes classifier estimates the conditional probability of a particular word given a class as the relative frequency of term
Here,
For training the model, two Question Answer (QA) datasets: SQuAD [20] and WikiQA [21] are used. SQuAD is one of the popularly used QA dataset which is comprised of comprehensions and question answer pair. WikiQA is a QA dataset containing questions and its answer as small sentences. Both the datasets contain questions and answers. However, as per the requirement of the proposed model, we need only sentence and its query class. Thus, the datasets are pre-processed to extract only these two requisite attributes. It is further filtered to keep the specified four query classes (WHAT-IS, WHERE, WHAT-HAPPEN and WHEN). After preprocessing, 1019 samples are retrieved. The model is trained with this preprocessed dataset (939 samples). Later on, the trained model is tested with remaining 80 samples and used to determine the query class of crawled sentences from the web.
Once the query class is known, the sentence needs to be transformed and stored as knowledge in the knowledge base so that it can be later used in the inference mechanism. Here, transformation basically means the converting the unstructured text into CLIPS compatible facts and (if-then) rules with the help of POS tags. Upon analysis, it is found that sentences belonging to the same class exhibit some common pattern. The sentences categorized as “WHAT-IS” query-class generally composed of two components: subject and information related to the subject. These two components can be arranged in either way: subject
The sentences which belong to “WHAT-HAPPEN” query-class are mainly the complex sentences or compound sentences. They either consist of two sub-sentences or made up of several simple sentences joined with help of conjunction or other connecting words. These kinds of sentences should be represented as if-then rules. Next challenge is to identify the antecedent and consequence part of the rule from the given sentence. For example, “If you keep a Goldfish in the dark room, it will eventually turn white.” In this sentence, the first half of the sentence is antecedent and second half belongs to consequence part. Whereas, in the sentence, “Butterflies can only fly when their temperature is above 27
else if question class = WHEN first noun â getFristNoun(s) verb â getVerb(s) time â getNumber(s) structure_info â createStrucutre(first noun, verb, time)
else if question class = WHAT-HAPPEN antecedent sentence, decedent sentence â lineSplit(s) first noun â getFristNoun(antecedent sentence) verb â getVerb(antecedent sentence) second noun â getSecondNoun(antecedent sentence) antecedent â createStrucutre(first noun, verb, second noun) first noun â getFristNoun(decedent sentence) verb â getVerb(decedent sentence) second noun â getSecondNoun(decedent sentence) decedent â createStrucutre(first noun, verb, second noun) structure_info â createRule(antecedent, decedent) end if
Basically, two types of sentences can be categorized in “WHEN” query-class: sentences which are associated with time and sentences which are associated with a situation. We majorly focused on the first category of sentences in this paper. Sentences of “WHEN” query class are composed of three components: noun, verb, and numbers or phrase containing numbers. For example, “India won the world cup in 2011” in the following sentence “India” is a noun. “Won” in a verb and “2011” is number. It is observed that the sentences of this category mostly have the word “in” followed by a number. These sentences contain atomic knowledge and can be stored as facts. After all the observations, we propose an algorithm for knowledge extraction and its structuring.
Confusion matrix of technical education university dataset
Confusion matrix of technical education university dataset
In this algorithm, input is a sentence and the sentence query class. Firstly, the sentence is preprocessed to remove special characters and to create symmetry between the words present in the sentence. Then, according to the query-class, the sentence is decomposed. This content is passed as parameter to the createStructure() function which returns a structured string representing either a fact or a rule. At the end, all the structured information is updated in the knowledge base.
The implementation is done using JAVA with CLIP-S. The dataset consists of two attributes i.e. sentence and its query class. After training, the model is tested on remaining 80 samples of the preprocessed dataset with 20 sentences from each category.
To begin with, an experiment is conducted to select a most appropriate classifier for the proposed model. Here, the simulation is carried out using Multinomial Naive Bayes (MNB), random forest (RF), SVM with linear kernel (SVM
It is observed that time complexity of training process for RF and SVM
The trained model is then used with the knowledge extraction module. This module used the trained classifier to map the text into facts and rules. Table 4 shows the excerpts of the same.
Simulation for a technical education university
The designed framework is also simulated for a technical education university. For this, the data are crawled from its website. The crawled information is preprocessed and classified into query class using the already built classifier module. It is to be noted that as we are working with query class only, the dataset and the classifier module designed earlier are good enough to be used in any domain to determine the query class of any English sentence. Now, the knowledge is extracted and fed to the knowledge base. Table 5 shows the excerpts of the same. For example, the sentence “Student failing in a subject will be awarded F grade” is first cleaned and classified as WHAT-HAPPEN query class. Then, it is converted into CLIPS compatible rule format for use in further inferencing. The model is tested for 403 sentences from this domain and it reported accuracy of 94.04%. Table 6 shows the confusion matrix for the same.
Conclusion
Technologically savvy lifestyle of modern digital natives wants everything with effortless engagement and comfort. Keeping this in mind, we proposed an agent, HELP. It initiates by the generation of a knowledge base through knowledge acquisition. The data collection module acquires knowledge by crawling information from numerous online resources. Then, this module performs preprocessing to segregate acquired information into smallest meaningful unit i.e. sentences. These sentences were further tagged to a query class using a Multinomial Naive Bayes classifier. After tagging of sentences, the knowledge extractor converts the raw sentences in the CLIPS compatible facts and rules. It helps to conquer the issue of format non-uniformity. Moreover, a common inference engine is used to answer all kinds of queries sufficiently as the unstructured data is also represented in the form of rules and facts. The proposed model is simulated for the preprocessed dataset and technical education university dataset. It reported accuracy of 96.25% and 94.04% respectively.
The proposed model can be employed as an inquiry system or question answer system situated in a large variety of domains such as schools, universities, organizations, customer care centers, hospitals, railways, airports, banks, entertainment. Further, the model can be extended to cope with more query classes. It can also be extended to deal with WHEN class handling situations. Moreover, work can be to improve the scalability of the model.
