Abstract
Extraction of knowledge data from knowledge database using natural language query is a difficult task. Different types of natural language processing (NLP) techniques have been developed to handle this knowledge data extraction task. This paper proposes an automated query-response model termed Extended Automated Knowledge Provider System (EAKPS) that can manage various types of natural language queries from user. The EAKPS uses combination based technique and it can handle assertive, interrogative, imperative, compound and complex type query sentences. The algorithm of EAKPS generates structure query language (SQL) for each natural language query to extract knowledge data from the knowledge database resident within the EAKPS. Extraction of noun or noun phrases is another issue in natural language query processing. Most of the times, determiner, preposition and conjunction are prefixed to a noun or noun phrase and it is difficult to identify the noun/noun phrase with prefix during query processing. The proposed system is able to identify these prefixes and extract exact noun or noun phrases from natural language queries without any manual intervention.
Keywords
Introduction
A number of Natural Language processing (NLP) techniques have been developed to reduce the gap between computer and human interaction. The NLP has become one of the most active interaction area of computer and humans. NLP enables communication between computer and human without any complex procedure as in [1]. NLP based interfaces are used in interaction process where computers are able to understand and manipulate natural language text or speech. Natural Language Interface to Databases (NLDBI) systems have been developed by researchers since five decades for various languages and different types of underlying databases. NLDBI systems extract desired information from databases after processing the natural language query as in [2]. Knowledge extraction, Text summarization, Information retrieval, Information extraction and domain specific applications are few examples of
natural language text processing. Lexical analysis, Syntactic analysis (Parsing), Semantic analysis, Disclosure Integration and Pragmatic analysis are five general steps defined in the methodology of NLP text processing. In general, lexicon-based [3] analyzer plays an important role in sentiment analysis that involves to calculate sentiment polarity using word dictionary annotated with sentiment score as in [4]. The structure of words in a sentence is identified by the lexicon analyzer. Lexicon analyzer becomes active in division of whole chunk of text into words, sentence or paragraphs. Each word is analyzed in the sentence for grammar that shows the relationship among the words in a sentence. This analysis process is called syntactic analysis or parsing. The syntactic analysis is automated text processing that is based on grammar and vocabulary where each word is assigned as in [5]. The quality of syntactic analysis depends on grammar and vocabulary but lack of adequate structural model of syntactic analysis may not interpret the phrases if a word in the phrase does not feature in the dictionary. Absence of a word in dictionary makes syntactic analysis impossible as in [6]. The exact text meaning or dictionary meaning from text is checked for meaningfulness by semantic analyzer. Semantics is related to text understanding and this is a very important task in semantic analysis. Many researchers have devoted their efforts to this field. As an example, Named entity recognition (NER) [7] discovers named entities in a text and classifies them into categories. The topic models for “latent topics” reorganization refers to text to probabilistic distributions on words. The activity of Entity linking [8, 9] is on retrieving “explicit topics” that is stated as probabilistic distributions on an entire knowledge base [10]. Discourse integration in NLP refers to the meaning of individual sentences which depends on the previous sentences that precede it. The structure of the sentence is reinterpreted and the meaning that was originally stated is interpreted. This part of the analysis is called Pragmatic analysis. NLP techniques are used to extract knowledge from Knowledge Management Systems (KMS). Protégé and NeOn Toolkit are the most popular open source tools that are designed for supporting active participation among knowledge management’s experts [11]. Thus, research and development activities have been focused on creating NLP based interfaces in KMS to overcome language barriers. Previously, two NLP based KMS have been elaborated in [12, 13, 14]. The NLP based KMS is domain specific and it has been developed on Request-Response model. Permutation-Combination (PC) and Grammatical Rule (GR) based KMS have been developed to process the natural language queries and provide response to the client side. PC based KMS termed Knowledge Provider System (KPS) is semi-automated and GR based KMS termed Automated Knowledge Provider System (AKPS) is automated request-response model [12, 13] respectively. Both systems have some limitations in processing natural language queries. KPS can handle assertive and interrogative type simple queries. KPS extracts noun and verb from query sentence but is not able to extract noun phrases. Extraction of noun phrases refers to chunking of nouns which is a concept of deep parsing. The drawback of KPS is that it does not deal with deep parsing; neither can it handle compound or complex type sentences. AKPS has the ability to extract single noun or noun phrases. This technique has been developed with the concept of deep parsing. But AKPS depends on grammar rules that are predefined. The predefined rules are used at the time of PoS tagging. Maximum number of grammar rules should be maintained to achieve better parsing performance. The limitation of this AKPS is that it cannot extract noun(s) from complex and compound sentences. The proposed system termed Extended Automated Knowledge Provider System (EAKPS) has been developed to handle assertive, interrogative, imperative, compound and complex type query sentences. The EAKPS is an automated query-response model capable to extract noun or noun phrases with prefix like determiner, preposition and conjunction. EAKPS works on creation of structure query language (SQL) from a natural language query to generate response.
Related works
Natural Language processing is a technique that is applied for human-computer interaction. Various useful NLP based applications have been developed in past few decades. Speech recognition is another added advantages in NLP based system. Speech and Text both can be utilized in NLP based system to extract information. A natural language query to SQL conversion has been proposed by the Authors [15]. The proposed system is able to handle complex type queries. This system has been developed for the Placement cell officer who is the novice user. This system is also equipped with the speech recognition as in [15]. Database plays an important role in computer field and this is a major source of information. Maximum IT applications store and retrieve information from database. There are various interfaces like form based, natural language and keyword based. Knowledge of database language like SQL is essential for data retrieval from database but common people (other than SQL experts) do not write SQL query. Therefore a user friendly interface of natural language to database (NLDBI) and Keyword Based Interface to Database (KBIDB) have been identified by the researcher. The proposed work is based on natural language and keyword based interface to database (NLKBIDB) [16]. The architecture of NLKBI DB has been developed which provides solution for syntactically correct and incorrect natural language input query as referred [16]. The partial experiment of Lexical Analyzer and Key word based interface has been done on agriculture survey database. 53% of syntactically incorrect query has been solved by this system as in [16]. Dashboard is integration, validation, and visualization natural language processing (NLP) based tool which provides infrastructural facilities. The facilities include evaluation of individual NLP modules and combination of multiple NLP modules to build a large end user NLP system. This tool is used by the system integration team to integrate and validate the NLP system. The interface of Dashboard helps the developers to profile (time memory) each module. Dashboard tool is used by the researcher to evaluate and compare their module with the earlier versions of same module. The heterogeneous platforms are used for execution of NLP based modules and NLP based distributed modules by the Dashboard tool. The graphical interface of dashboard is easy to use for the user and it is uses Eclipse RCP that allows user to perform the task better as in [17]. UNL is a collaborative framework as well as application of natural language processing. This frame work encourages and promotes the participation of linguists and non-linguists in the development of an integral natural language processing workbench. The UNL workbench has user friendly back-end and front-end applications that assist to learn UNL basics. This proposed system has the ability to analyze natural language into their abstract semantic meanings and finding the common denominator between all languages as in [18]. Another NLP based system has been proposed that accept natural language question as input. This system is able to generate SQL from the natural language question and extracts information from the database. The railways reservation database has been used for the extraction. Tokenization, Lemmatization, POS Tagging, Parsing and Mapping have been used to generate the SQL from natural language queries. The proposed system uses the regular expression for mapping the queries as in [19]. Efficiency of information retrieval from structured database through natural language provides high usability and utility. Extraction of relevant information from structured database from limited number of query words or sometimes a single word is the challenge in information retrieval. The correct interpretation, disambiguation and context resolution is needed in natural language query processing. The proposed work has been done on Student Interface System by querying user in Natural Language (English). The natural language query sentence is semantically parsed and transformed into intermediate query form. Finally, this model converts intermediate query into the structured query language and extracts information from the structured database as in [20]. Chat robot is specifically tailored for providing FAQ Bot system for university students. The chat robot accepts natural language as an input, and then processes it through the information repository and sends response in natural language. The Information repository has been modelled by a connected graph where each node contains information and information nodes are linked internally. The AIML (Artificial Intelligence Markup Language) has been used for authoring the Information repository. Information repository has been separated from the natural language interface component by the chat robot design. Correspondingly, a pure dialog system (associated with natural language knowledge data based entries), a domain knowledge system (engineered with information content) and a hybrid system (combination of dialog and domain knowledge) have been developed for experimentation. Therefore, Information repository can be easily modified and directed on particular topic without recreating the code design. Topic specific dialog with conversational knowledge is required the maximum dialogue session than the general conversational dialog as per the suggestion of experimental parameters and outcome as in [21]. The main challenge of natural language interface to database (NLIDB) is reducing the gap between natural language query and data from the database. Two things are very common in this challenge and that are keyword mapping and path inference. Keyword mapping is related to match natural language keywords with entity, attribute or value of the database. Another thought is join path inference that generates SQL using selected relation and join condition. But, the authors [22] have proposed leveraging information from the SQL log that will enhance the performance of the existing NLDBI systems as in [22]. The natural language query to SQL generation is a main purpose for any NLDBI for non-technical user. Most of the NLDBI are question answering based interface to database. Basically, natural language questions are factoid questions and that will convert to the complex type SQL query which is an open research problem. Authors [23] have proposed machine learning based system that will help to the existing NLDBI system for complex SQL generation. This system is not rule based and it generates multiple factoid sub-queries from complex type queries. The sub-queries can be handled by the NLDBI easily to generate responses from the database. The responses of the sub-queries and desired has been used for model training. A multiclass classifier has been used to generate the responses from valid input queries as in [23]. Annotation of text with tag represents the desired output of the computational linguistic and natural language processing automation task. The annotation tags are provided for training, validation and evaluation. Part of speech and gloss tags are used for Arabic morphological analysis in Arabic computational linguistics and natural language processing. Several manual and automated tagging tools have been developed for text but very few of them are based on Arabic morphological analysis. An open source tagging tool with visual interface has been developed that enables the construction of annotated Arabic text corpora with automatic morphology-based tags. The specification of tags with Boolean formulae is authorized by this open source tagging tool where the atomic predicates are matched and relations are contained between the morphological solution of part of the text and morphological feature value. Entry of manual tags edit of existing tags through a tag sensitive colouring interface, comparison of tag sets and computation of accuracy results are provided by this tool at the user end as in [24]. Natural language statement to SQL conversion using RNN auto-encoder and decoder are another useful work that has been given in [25]. The RNN auto-encoder has been used in this system which is a main component for online human language translation. Structure information has been extracted by the Decoder in training data input. Decoder correlates structure of the output with input and generates SQL statements as in [25]. The proposed work of [26] is based on the design of Natural Language Interface for Web based information retrieval. The natural language query is main source to retrieve the documents from the web. The proposed model reads natural language query and converts to more formal representation such as First-Order Predicate Logic structure. First-Order Predicate Logic structure is easy to manipulate for computer programs. This proposed work provides an interactive interface for the search engine that will enhance the performance for information retrieval as in [26]. Sem-QAS is a present system to explore semantically annotated data for end user in job search domain. A natural language text query is translated into SPARQL by the technique of Sem-QAS. It semantically identifies distinct atomic filtering constraints and their semantic association s from the input text query. Complex SPARQL queries are combination of the triple patterns which are generated for atomic filtering constraints. This Complex SPARQL queries are formed dynamically by the Sem-QAS. A special attention has been given to the processing of scope modifiers and association operators by this system for high recall and accuracy. Mooney Job data set and queries collected from a real job search engine have been used for the efficiency and the correctness of Sem-QAS as in [27]. In this section, maximum systems are NLP based system that are handling natural language text or question. Natural language to SQL conversion is an open research problem in natural language processing. Our proposed work is not keyword [16, 22] based, it extracts noun or noun phrases from the natural language query. It uses noun or noun phrases to represents the semantic for SQL generation. Few semantic based systems [18, 20] have been given in this section. NLP steps like Lemmatization, POS tagging, parsing and mapping have been used in [19, 24]. The proposed system is using tokenization, POS tagging, parsing, semantic representation steps without any third party tool like POS tagger, NLTK tool or any WordNet database.
Architecture of EAKPS
The proposed system is based on the automated query-response model. This system has been extended from AKPS [13]. Extraction of explicit knowledge from the default database is a primary work of this system. This system is NLP based and it generates SQL query statement from NLP query. The context diagram of EAKPS has been given in Fig. 1 where the Client will login into the system first. The Client will post a query in natural language (NL). This system will read the NL query from user and start its validation. After which POS tagging is applied and this generates a unique word or list of unique words which are not mentioned in part of speech table. The system will use these unique words to create substring. Each substring will form combination of words. The repetition of combination(s) may occur many times and the system will remove them wherever applicable. EAKPS will read each combination and apply semantic analysis where each combination is matched with words in synonyms database after removing repetition of combination(s) step. Each combination will be searched in Synonyms database for matching to represent the semantic table that has been discussed in 3.1 Algorithm Section. The SQL will be created from this semantic table. System will process the SQL to generate response from the default knowledge database. If the request data are not available in the default knowledge database, then the request is stored in the temporary database and shall have to wait for the update of the default knowledge database for this unmanaged request termed the pending request. When the default database is updated for this pending request, the EKPS system automatically generates the response of this pending request to the client.
Architectural diagram of EAKPS.
Basic Algorithmic Steps of EAKPS:
NL query reads and validation. POS tagging. Substring formation. Create combination. Remove repetition of combination(s) wherever applicable. Semantic analysis. Create SQL. Generate response from SQL.
Step 1: NL query reads and validation
The proposed system reads the query in natural language (NL). The natural language query splits the query into tokens (individual words). The proposed system uses the parts of speech (POS) list for validation of the query. The POS list does not contain the noun, verb, adverb, and adjective. These are termed as unique tokens. The first step of validation checks whether unique tokens are generated and a natural language query should contain at least a single unique token. If number of unique token is less than 1, it means the query does not contain any unique token and then proposed system will stop the process, else proposed system will validate the query.
Step substring formation: NL query array.
Step Substring formation: NL query array with blankspace.
Step substring formation: Sub-string array.
Step 2: Pos tagging
The parts of speech list contains WH words, Auxiliary Verbs, Determiners, Pronouns, Prepositions and Conjunctions. These are initialized in string array (WH [ ], AUX [ ], DET [ ], PRO [ ], PRE [ ], CON [ ]). The natural language query splits into another string array and starts to match each word from string array with each words from WH [ ], AUX [ ], DET [ ], PRO [ ], PRE [ ], CON [ ] array. If the words match, then remove the match words and store blank space on this particular word position of NL query array.
Pseudo Code:
//The user submitted sentence are
tokenized and stored in string
array.
(1) Read Input Statement S.
(2) Split Input Statement S into
the String Array called Words[n].
(a) Declare Parts of Speech List:
WH[n] = {"what", "which"...}
AUX[n] = {"is", "am", "are"...}
PRE[n] = {"of", "to", "in", "for"
...}
DET[n] = {"a", "an", "the"...}
PRO[n] = {"I", "you,"he"....}
....
Words[n] = str1.Split();
FOR i = 0 to i < word.Length
FOR j = 0; to j < Wh.Length
IF(words[i] == Wh[j]) THEN
words[i] = ",";
END IF
END FOR
END FOR
FOR j = 0 to j < Aux.Length
IF(words[i] == Aux[j]) THEN
Search_Aux[k] = words[i];
words[i] = ",";
k is incremented by 1
END IF
END FOR
FOR j = 0 to j < BVerb.Length
IF(words[i] == BVerb[j]) THEN
Search_bverb[m] = words[i];
words[i] = ",";
m is incremented by 1
END IF
END FOR
FOR j = 0 to j < Pronoun.Length
IF(words[i] == Pronoun[j]) THEN
words[i] = ",";
END IF
END FOR
FOR j = 0 to j < Determiners.Length
IF(words[i] == Determiners[j]) THEN
words[i] = ",";
END IF
END FOR
FOR j = 0 to j < Conjunction.Length
IF(words[i] == Conjunction[j]) THEN
words[i] = ",";
END IF
END FOR
FOR j = 0 to j < Preposition.Length
IF(words[i] == Preposition[j]) THEN
words[i] = ",";
END IF
END FOR
Step 3: Substring formation
The NL query array contains blank spaces and unique words those do not match with parts of speech list after the step of POS tagging. An example has been given below step bystep.
NL query – “What are the course names offered by the Andhra University?” The NL query splits into the NL query array has been given in Fig. 2. This is the NL query array where each word is stored as per formation of the sentence. The matching words will be removed from the NL query array after parsing. The blank space will be stored at the position of removed words in the NL query array (Fig. 3). The NL query array (Fig. 4) contains unique words “course”, “names”, “offered”, “Andhra”, “University” and “.” that refers to the blank space after POS tagging. All the adjacent words that exist before blank spaces are copied as an unit to sub-string array and all the adjacent words in NL query array that exist after blank space(s) are also copied as an adjacent unit to sub-string array leaving out all blank spaces in NL query array. The Sub-string array has been given in Fig. 4. The “course”, “names” and “offered” words are added sequentially with blank space and form a string like “course names offered”. That string will be placed in the 0
Step 4: Create combination(s)
The proposed system will read the string from the sub-string array one by one. The system will create combination of “course names offered” string at first. Each combination will be stored in Combination table (Table 1). After completion of the combination of “course names offered” string, the system will read another string for creating combination from sub-string array. Again the system will store each combination to the Combination table. This process will be done by the system while the sub-string array value is not equal to “Null”. The combinations of “course names offered” string and “Andhra University” have been given in Table 1.
Combination table of sub-string
Step 5: Remove repetition of combination(s) wherever applicable
The proposed system will create many combinations on selected string. The repetition of combination in Combination table may occur after completing of all combinations on all selected string from sub-string array. This system will check the repetition of each combination in the Combination table. If the system will detect any repetition of combination then detected repetition will be removed from the Combination table. Each combination will be unique in the Combination table.
Pseudo Code:
//Create Combination and Remove
Repetition of Combinations
wherever applicable
FOR i = 0, j = 0 to j < subs1.Length
IF(subs1[j] != ",") THEN
subs2[i] = subs2[i] + " " +subs1[j];
ELSE
i is incremented by 1
END IF
FOR i = 0 to i < subs2.Length
K = 0;
IF(subs2[i] != null) THEN
words = subs2[i].Split();
FOR j = 0 to j < word.Length
1 = words[j].CompareTo(" ");
IF(i < 0) THEN
Do nothing;
ELSE
copywords[k] = words[j];
k is incremented by 1
END IF
END FOR
CALL GetCombination (copywords, k)
Method
GetCombination (string[] list,
int m) Method
Int i, j, k, vcount;
Semantic sm1 = new Semantic();
String calc = null;
Double count = Math.Pow(2, m);
FOR i = 0 to i <= count
String str = Convert.ToString(i,
2).PadLeft(m, ’);
FOR j = 0 to j < str.Length
IF (str[j] == ’) THEN
calc = calc + list[j] + " ";
END IF
END FOR
sm1.InsertCombination(calc);
//Call Insert Combination Method.
calc = null;
END FOR
END GetCombination
....
Semantic table
Step 6: Semantic analysis
The system reads each combination from the Combination table. This system uses three types of search for matching each combination. Entity Search from Entity table of Synonyms database, Attributes search from Attributes table of Synonyms database and value search from all tables of the entire default database. The Entity table contains the entity name and synonyms words of entity name with Primary key (PK), Foreign key (FK) and Candidate key (CK) of Default database. The Attributes table contains entity name, Attributes name, synonyms words of Attributes name with Primary key (PK), Foreign key (FK) and Candidate key (CK) of Default database. The proposed system starts to match each combination with each value in Entity table and Attributes table of the synonyms database. If combination matches with any value of Entity table or Attributes table then entire row will be selected in particular table. The proposed system starts to read each value from selected rows of Entity table or Attributes table and inserts them in Semantic table except synonym value. If the Combination matches with any value in any table in the default database then the column name of the corresponding table is selected and the attributes table of the synonyms database is searched again corresponding to the column name wherein the entire row values with corresponding columns name are selected. Again the system will insert the value into the Semantic database. If the combination does not match with any value in default database then proposed system will not be able to generate SQL and it will return error message to the client side. The semantic table (Table 2) has been given below as per the above example.
Step 7: Create SQL
The proposed system will create the structure Query Language (SQL) from Semantic table. The Semantic Table has the detailed information about the Entity name, Attribute name, PK (Primary Key), FK (Foreign Key), CK (Candidate Key) and Value. The Entity column contains “Course” and “Universities”. Attrib column contains “Course_Name” and “University_Name”. The PK, FK and CK hold the keys which are defined in the default database. The Value column contains a single value “Andhra University”. The proposed system will read first value from Entity column and start to check repetition of selected value in Entity column. When this process will run then system will select the first value from Attrib column and start to check repetition. Again, the system will do the same with other column up to Value column. The nested looping concept has been applied to construct the SQL. The SQL has been given below from the above Semantic table. Pseudo code from semantic steps to SQL generation step has been discussed in [13].
SELECT Course_Names, University_Name FROM Course, Universities WHERE Universities.U_Id
Step 8: Generate response from SQL
The proposed system will execute the SQL and extract knowledge data from the default database. If the system fails to generate response for any query then it will be stored in temporary database as pending request. The default database will be updated from many sources. Database Administrator, System Administrator or any other Administrator may take a part to update the default database. The system will generate response on pending request after completing update of default database.
The proposed system processes the natural language query and generates response from the default database. The default database is domain specific database. Three databases have been used as default database in this proposed system. These databases have been implemented on E-University information services, E-entertainment information services and E-health information services.
Database of E-University information services
The detail of E-University information database has been given below. The detail figure and tables have been given in [13].
Entity table – Entity table contains entity name with its synonym words. The attributes of this proposed entity table has been given below.
Entity_Name – This attribute holds the entity name. Synonyms – Synonym words of each entity have been defined in this attribute. PK – This attribute contains primary key of a table. FK – It holds foreign key of a table. CK – A table may contain candidate key that can be used to identify records uniquely from table/tables. Attributes table – Attributes table contains attributes according to the table name. The attributes of this proposed attributes table has been given below.
Entity_Name – This attribute contains the entity name. Attribute – It holds attributes name according to the entity name. Synonyms – Synonym words of each attributes name have been defined in this attribute. PK – This attribute contains primary key of a table. FK – It holds foreign key of a table. CK – A table may contain candidate key that can be used to identify records uniquely from table/tables.
These two tables of synonyms database have been used in step 6 of algorithm in Section 3.
Database of E-entertainment information services contains many tables that are related with each other by the primary key, foreign key and candidate key. This database contains movie, theatre, movie time related knowledge data. The implementation idea of this database has been taken from “bookmyshow” (
Database of E-health information services
The database of E-health information contains Employee, Department, Doctor, Nurse, Patient, MRecords tables. The relationship between the tables has been done using primary key-foreign key concept. This knowledge database has been implemented and implementation structure is similar to E-University knowledge database.
Analysis of proposed system
Analysis of sub-string formation
The proposed algorithm contains an important step that is substring formation. The proposed step eliminates matched tokens after POS tagging and creates substring using unique or unmatched tokens. Each substring has been used in creation of combination(s). If number of combinations are 2, 4, 8, 10, 20
Let, there are n numbers of tokens in a query (Fig. 5). (Here
Analysis of sub-string formation: Token array.
A token may be matched or unmatched according to our System.
A token be matched symbolizes as
Therefore, Probability of matched tokens (
Let,
Where,
Therefore,
and thus the probability that a streak of
Union bound or Boole’s inequality refers to the probability of a union of events that is at most sum of individual events probabilities as in [28].
Now use inequality Eq. (4.1) to bind the length of the longest streak.
For,
Therefore,
According to
Figure 6 shows the partition of the token array.
Analysis of sub-string formation: Token array partition.
Two partitions have been done on length of a string that is started from 0 to
The first partition is from 0 to
Therefore,
The probability for first partition that is 0 to (
Therefore,
Here
Example: NL Query: “What are the course names offered by the Andhra University?”
In this Query, unmatched tokens are: “the”, “course”, “names”, “offered”, “the”, “Andhra”, “University”.
Therefore, the number of unmatched tokens
Since, the base of the logarithm is 2 that has been taken into the account and it is already shown in this section that
Therefore, “expected longest streak of unmatched tokens” of the above query will be
The algorithmic steps have been defined in Section 3 of the proposed system. The time complexity has been calculated after representing the operation of the proposed system with a function
Let, the time taken for NLP query reads and validation step is Let, the time taken for POS tagging Let, there are
Time complexity: string array partition.
The total number of substring The total number of spaces Here, the time complexity Let, in SS1 there are For 1 If the 1 Here, time taken If the 1 Let, The default database is consisted of So, there will be pq number of cells where the 1 Therefore, time taken Similarly, in Thus, Total time taken for Let, the time taken for Create SQL and Generate Response steps be

Therefore, Total time complexity:
“
Time taken by the algorithm –
Query string: “what is the course fee of B.A (English) offered by the Andhra University?”
No. of unique tokens
Time units may be in nano second,
Example of the time complexity: The time complexity of ““what is the course fee of MCA of the University of Burdwan?” will be
“Course fee”, “MCA”, “the University of Burdwan”
Time units may be in nano second,
Time complexity of an algorithm refers to the total time taken by the algorithm from run to completion of a specific task. The big
Aggregate Method: The aggregate method calculates the amortized cost of each operation. A sequence of n operations takes in worst-case time
Process name and process numbers (EAKPS)
Process name and process numbers (EAKPS)
HashTable_Insert (P, x)
If P.size == 0
Allocate P.table with 1 slot
P.size = 1
if P.num == P.size
allocate new-table with 2 *
P.size slots
insert all items in P.table into
new-table
free P.table
P.table = new-table
P.size = 2 * P.size
insert x into P.table
P.num= P.num+ 1
The analysis of sequence of n HashTable_Insert operations has been done on initially empty table. The current table has space for new item to be inserted or if it is the first operation then
The Cost of the
The total cost of
As per the above equation, the total cost of
Considering the above HashTable_insert function and equations let,
P.num
P.size
i. Amortized Cost of Extended Knowledge Provider System(EAKPS):
Process name and Process numbers have been given in Table 3.
Amortized Cost calculation:
Where,
where,
Each process of EAKPS will take 2.5 time units (Time units may be in nanosecond,
Parser plays an important role in natural language processing. It reads natural language sentences as an input and does the syntactic analysis of each sentence. Parser basically uses the Parts-of-Speech (POS) tagging. POS is grammatical categories of words like noun, pronoun, verb, adjective, adverb, preposition, etc. Many NLP based systems have been developed to extract grammatical formation of a sentence. Parsing can produce the grammatical formation of a sentence but extraction of semantic words from this grammatical formation is another challenge. The semantic word(s) may be a single noun, noun phrase, adjective, adverb, verb or verb phrase. Simple parsing can extract noun, pronoun, adjective, adverb, verb, etc. individually but simple parsing is not able to extract noun phrase or verb phrase. Noun phrase and verb phrase constructs the sentence and determiner, adjective, adverb may be initialized inside these phrases. Deep parsing works on identification and extraction of noun phrases and verb phrase from the sentences. An example has been given below
“What are the course names of The University of Burdwan?”
The simple parsing will return the grammatical formation of this sentence. The grammatical formation of this sentence will be “Wh
But this formation is not enough for the extraction of semantic word(s) from this sentence. “Course names” and “The University of Burdwan” are noun phrases in this sentence and assuming these phrases are needed for further process, the noun phrases should be extracted from the sentence. The single noun or noun phrases have been extracted to be utilized in generation of SQL query.
Basically three types of problems have been raised in this research work during the extraction of phrases.
Determiner problem
Extraction of noun phrase is a difficult work. Determiner may be present in a noun phrase or phrases. Identification of determiner from a noun or noun phrase is a challenging work. The example has been given below that
“What is the course fee of MCA of the University of Burdwan?”
Simple parsing will produce the grammatical form =- “Wh
The requirement of noun or noun phrases are
“Course fee”, “MCA” and “the University of Burdwan”.
In the given query, “course fee” is required to be extracted as “the course fee” phrase is not required whereas, “the University of Burdwan” is required to be extracted instead of “University of Burdwan” The former requires the determiner “the” to be dropped whereas the latter requires the determiner “the” to be retained. It is very complicated that which determiner should be removed from noun or noun phrases and which should be kept or retained.
The grammatical form has been given below
“Course fee” – Noun
“MCA” is a single noun and it can be detected after simple parsing but “course fee” and “the University of Burdwan” are noun phrases and it cannot be detected after simple parsing. The combination method has been applied to detect single noun or noun phrases which is stated in step 4 of Algorithm in Section 3. The query “What is the course fee of MCA of the University of Burdwan?” will be divided in substring after completing step 1 and step 2 of Algorithm in Section 3. The combination method will be applied on each substring in step 4 of Algorithm in Section 3. The system will remove repetition of combination wherever applicable in step 5 after creating all combinations of substring. The proposed system will create the possible semantic words from created combinations of each substring in step 6 of Algorithm in Section 3. The semantic words have been utilized to create the SQL in step 7. The SQL has been given in Fig. 8.
SQL statement.
The Fig. 8 shows that “The University of Burdwan” is initialized in “where” condition with “University_Name” attribute. The “The University of Burdwan” is noun phrase which is combination of nouns, determiner and preposition. The formation of this noun phrase is “Determiner
Result of the query.
Extraction of noun phrases with preposition is another problem in natural language query. Sometimes the preposition may be required in noun phrase or maybe not. Assume the natural language query is – “what is the duration of MCA of the University of Burdwan”. The requirement of noun or noun phrases from this natural language sentence to generate the SQL as per database has been given below.
“Duration”, “MCA” and “the University of Burdwan”
The “of” preposition has appeared three times in this example sentence. Three noun phrases can be extracted through “of” preposition from this example sentence. The possible noun phrases are
“Duration of MCA” “MCA of the University of Burdwan” “The University of Burdwan” “Duration of MCA of the University of Burdwan” and other combinations.
“Duration” is an attribute of Course relation where “MCA” and “the University of Burdwan” are values in default database. The formation of SQL as per default database will be – “SELECT Course_Names, Duration, University_Name FROM Course, Universities WHERE Universities.U_Id
“Duration” and “MCA” will be extracted as single noun but “the University of Burdwan” will be extracted as noun phrase. “duration of MCA”, “MCA of the University of Burdwan” and “duration of MCA of the University of Burdwan” will be rejected by the proposed algorithm. Noun and noun phrases will be extracted in step 6 and the SQL will be generated in step 7 of the proposed algorithm in Section 3.
The SQL has been given in Fig. 10 and response has been given in Fig. 11.
SQL formation.
Response of natural language query.
SQL formation.
Another “of” preposition problem has been discussed here. Assume, the same natural language query “what is the duration of MCA of the University of Burdwan” has been given by the user and university name “the University of Burdwan” has been replaced to “Burdwan University” in default database then formation of SQL is hard. But, this proposed system will recognize the noun phrase “the University of Burdwan” to “Burdwan University” noun phrase. Because “the University of Burdwan” and “Burdwan University” both combinations will be produced by the proposed algorithm. Creation of combinations has been discussed in step 4 of Section 3. The repetition of each combination(s) will be checked in step 5 of Section 3. “Burdwan University” is a value of default database that will be identified by value search method in step 6. The identified combination as a value will be initialized in semantic table. The formation of SQL from semantic table and Response will be generated in step 7 and step 8 of Section 3. Figure 12 shows the SQL format ion and Fig. 13 shows the response of NL Query. The proposed system is able to generate all possible combinations of noun and noun phrases from the natural language query. This system can handle more complex type natural language queries of assertive, interrogative or any other type.
Response of the query.
Conjunction plays a vital role in natural language query. The example of natural language query “what is the duration of MCA of the University of Burdwan” asks about the duration of MCA course under the university of Burdwan but assume if the query is “what is the course fee and duration of MCA of the Burdwan University” or “What is the affiliation, duration and course fee of MCA of the Burdwan University,” then requirement of data is not single type. The response should be constructed with multiple data as per requirement ofuser.
Query1: “what is the course fee and duration of MCA of the Burdwan University?”
Query 2: “What is the affiliation, duration and course fee of MCA of the Burdwan University?”
The first query shows two requirements – “course fee and duration” but second query shows three requirements – “affiliation”, ”duration” and “course fee”. The “and” conjunction increases the requirement of data in natural language query. The “and “conjunction has been removed from the POS table because it has been utilized in making of many combinations in step 4 of proposed algorithm. The proposed system is able to handle these type of assertive, interrogative or any other type natural language queries where the “and” conjunction is present. Figure 14 shows the SQL of “what is the course fee and duration of MCA of the Burdwan University”.
SQL formation.
The response has been generated on above SQL which is given below in Fig. 15.
Response of the query.
The preposition and Conjunction can be attached with a noun phrase. Assume, the natural language query is – “what is the duration of MCA of the Durgapur Institute of Advanced Technology and Management”. The noun phrase “the Durgapur Institute of Advanced Technology and Management” contains adjective, preposition and conjunction. The grammatical formation of this noun phrase is – “Determiner + Noun + Noun + Preposition + Adjective + Noun + Conjunction + Noun”. The proposed system is able to extract the entire noun phrase from the sentence as well as it can differentiate other conjunctions and prepositions in the same phrase like “and” conjunction and “of” preposition in the noun phrase of the given natural language query.
The Parts of speech list contains WH words, Auxiliary Verbs, Determiners, Pronouns, Prepositions and Conjunctions. These are initialized in string array (WH [ ], AUX [ ], DET [ ], PRO [ ], PRE [ ], CON [ ]). The parts of speech list does not contain “and”, “of” and “the”. Because they play an important role in constituting the noun phrases. The preposition, conjunction and determiner have been utilized in making combinations to extract noun or noun phrase(s) by the proposed algorithm.
The EAKPS is able to handle single word based natural language query or a query that contain multiple words. The examples have been stated on single word based natural language query and multiple word based natural language query.
SQL formation.
Response of the query.
Usermay enter single word as a query like “universities”, “courses”, “Andhra University” or “MCA”. A single word query “universities” has been taken as an example. The EAKPS will read this “universities” single word query and start to check validation in step 1 of algorithm in Section 3. The single word should be unique and if it is unique, then EAKPS will go for POS tagging and Substring format ion steps of algorithm in Section 3. The unique word will be a single string (word) after POS tagging and Substring formation steps. A single combination will be created after Create Combination step of algorithm in Section 3. The combination will be – “universities”. The step 5 of algorithm in Section 3 will not be applicable because number of combination is 1. The selected combination (word) from combination table will be initialized in semantic table using three searching methods that have been discussed in step 6 of algorithm in Section 3. The SQL and Response of NL query will be generated from this semantic table that has been discussed in step 7 and step 8 of Algorithm in Section 3. Figure 16 shows the formation of SQL format ion and Fig. 17 shows Response of single word natural language query – “universities”.
Multiple word based natural language query
The multiple words based natural language queries on E-University information may be
“Burdwan University”. “The University of Kalyani”. “Show me the course fee of MCA of Burdwan University”. “Please show me the course names and course fee of Pondicherry University and Burdwan University from 1995 to 2014”.
SQL formation.
Response of the Query.
SQL formation.
The EAKPS can handle natural language query example i, ii, iii and discussion has been done in Noun(s) or Noun Phrase(s) Extraction section. The natural language query “Please show me the course names and course fee of Pondicherry University and Burdwan University from 1995 to 2014” is different type where number of university name is two and number of year is two. EAKPS will take the following steps to generate the response from this “Please show me the course names and course fee of Pondicherry University and Burdwan University from 1995 to 2014” query. Each step has been stated in algorithm in Section 3. The EAKPS will read the natural language query and start to check validation. After validation, The EAKPS will do the POS tagging to remove words from query sentence that are already initialized in POS table. Remaining words of query sentence are unique words. The EAKPS will do substring formation on unique words in step 3 of algorithm in Section 3. After step 3, The EAKPS will create combination(s) in step 4 of algorithm in Section 3. Again, the EAKPS will check repetition of combination in step 5 of algorithm in Section 3. The Semantic table will be generated after completion of step 6 of algorithm in Section 3. The detail of generation of semantic table has been stated in step 6 of algorithm in Section 3. Figure 18 shows the generated SQL from natural language query after completion of step 7 in Section 3. Figure 19 shows response after completion of the step 8 of algorithm in Section 3.
Response of the query.
SQL formation.
Response of the query.
The multiple words based natural language queries on E-entertainment information may be
What is the total capacity of Eyelex at Asansol? Could you tell me the movie name and actors at Asansol Eyelex? I want to know the movie name of Alia Bhat as well as the theater name? I want to know theater address and price of tickets of a movie of Alia Bhat?
Responses of query no. (iv) has been given in Figs 20 and 21.
The multiple words based natural language queries on E-health information maybe
What are the doctor names of Neonatology?
Tell me the patient name under Gastroenterology department.
Show me the allotment and salary with nurse name.
Show me the name of patient with medical report and theirnameofdoctoradmittedtill date.
Responses of query no. (iv) has been given in Figs 22 and 23.
The EAKPS can handle a wide range of natural language queries like assertive, interrogative, imperative, compound and complex type. But it has some limitations to handle time related queries and computational queries. Time related queries are using time stamp. Computational queries are related to the aggregation (Maximum, Minimum, Count, Average, Sum). The user may enter a natural language query to know about the past data, present data or expected data for future. These are time related queries and temporal database can be the best option to perform thesequeries.
Timerelated Queries:
“I want know about the course name of Burdwan University available last five year.” “How many courses have been running under Andhra University since 2010?” “Who was the Department Head of B.A (English) and B.A (History) of Andhra University in 2012?”
Computational Queries:
“How many courses are offered by the Burdwan University?” “How many departments have been opened by the AndhraUniversity?” “Please tell me the total number of coursesis offered by the National Institute of Technology.”
The EAKPS cannot generate response against these time related or computational type natural language queries. The Algorithm has to be modified to handle this type of queries. The entity, attribute and value matching is using linear searching method. This process will take time. This linear searching method can be modified to reduce the time complexity. A NoSQL database can be considered as a default database to enhance EAKPS. The JSON script creation from natural language query may be an extra advantage of EAKPS. In this section, the above discussion can be considered as a future work of EAKPS.
EAKPS is able to handle wide range of natural language queries. The EAKPS is flexible to generate responses from assertive, interrogative, imperative, compound and complex type queries. The natural language can be of any type. The noun extraction process from natural language query is complex. In this paper, the algorithm has been designed to extract noun or noun phrases from natural language queries. This will create combinations from extracted noun or noun phrases. After preparing the semantic table, the proposed system will generate right combinations for creating the semantic table that helps to create SQL and generates response to the client side. The noun or noun phrase extraction from query sentence is a deep parsing concept that has been implemented in EAKPS. Unlike the systems proposed in literature, this system does not have dependency on keywords. The EAKPS is eligible to generate the response against natural language queries from the user. Our proposed EAKPS can be adopted in many knowledge management systems to enhance their performance on handling natural language queries.
Footnotes
Acknowledgments
This research work has been done at Research Project Lab of National Institute of Technology (NIT), Durgapur, under Visvesvaraya PhD Scheme. Financial support was received from Visvesvaraya PhD Scheme, Deity, Govt. of India (Order Number: PHDM LA/4(29)/2014–2015 Dated- 27/ 4/2015) to carry out this research work. The Authors would like to thank Dept. of Information Technology, NIT, Durgapur, India for academically support to this research work.
