A Novel classification framework for the Thirukkural for building an efficient search system

Abstract

Thirukkural, a Tamil classic literature, which was written in 300 BCE is a didactic literature. Though Thirukkural comprises 1330 couplets which are organized into three sections and 133 chapters, in order to retrieve meaningful Thirukkural for a given query in search systems, a better organization of the Thirukkural is needed. This paper lays such a foundation by classifying the Thirukkural into ten new categories called superclasses that is helpful for building a better Information Retrieval (IR) system. The classifier is trained using Multinomial Naïve Bayes algorithm. Each superclass is further classified into two subcategories based on the didactic information. The proposed classification framework is evaluated using precision, recall and F-score metrics and achieved an overall F-score of 82.33% and a comparison analysis has been done with the Support Vector Machine, Logistic Regression and Random Forest algorithms. An IR system is built on top of the proposed system and the performance comparison has been done with the Google search and a locally built keyword search. The proposed classification framework has achieved a mean average precision score of 89%, whereas the Google search and keyword search have yielded 59% and 68% respectively.

Keywords

Natural language processing text classification information retrieval multinomial naive bayes classifier morphological analysis

1 Introduction

The Tamil language and literature have a rich and long tradition, going back in time to a little more than two thousand years. The Thirukkural, Agananooru, Purananooru and Silapathikaram are a few world-famous works. The Thirukkural, one of the most widely translated books of all time, has been published in a number of languages worldwide. The era of Tamil literature can broadly be classified into: Sangam literature, post-Sangam literature, medieval age literature and modern literature. The Sangam age, long reckoned the golden era of Tamil literature, dates back to the 1st century BC. The Thirukkural was written in the post-Sangam age. Having had such a long and glorious history, Tamil has transformed both in speech and text over the years. The current generation of Tamil speakers is largely unaware of the rich nuances characteristic of the lexical range used in ancient Tamil literature. This paper proposes a Thirukkural classification framework as an initial attempt to take ancient Tamil literature to today’s millennial.

Thirukkural focuses on morality and ethics. It offers all possible solutions for living a successful and happy life that are appropriate for any generation. It has been translated into 37 different languages around the world, necessitating computational processing of the Thirukkural.

The Thirukkural has 1330 couplets divided into 133 chapters, focusing on three major aspects of life: virtue, wealth and love [7]. The Thirukkural couplets in all the three sections provide advices to various classes of human society, as well as advice on what to do and what not to do. This paper attempts to merge the existing classification of the 3 aspects and 133 chapters into ten superclasses that denote the major classes of human society. The ten superclasses identified by the proposed framework are king, saint, scholar, friend, minister, family man, common man, husband, wife, and general. The superclasses are further classified into two subclasses: To Do (TD) and Not To Do (NTD). These specify aspects/advice to be followed, or not, by the respective superclasses. The proposed reorganization of the Thirukkural is done to make the computational classification easier. For instance, if an Information Retrieval (IR) system is built on top of the proposed classification framework, and a query, for instance, “What are the factors a king must consider while making a decision for his people?” is given, the proposed framework offers semantically closer results, given that all Thirukkural couplets related to king are grouped into a superclass and two subclasses.

The contributions of this paper are threefold:

Identification of 10 superclasses, and their respective two subclasses, for the Thirukkural.

Classification of the Thirukkural into superclasses using the Multinomial Naïve Bayes (MNB) Classifier.

Rule-based classification of subclasses.

MNB Classifier is one of the multiclass classification algorithms and suits well for small datasets. The Thirukkural has only 1330 couplets, and these couplets need to be classified into ten superclasses in the proposed work. Hence, the proposed work uses MNB Classifier to classify the Thirukkural couplets.

The remainder of the paper is laid out as follows. The background information about the Thirukkural is presented in Section 2. The related work is described in Section 3, and the proposed work is discussed in Section 4. The section 5 explains experiments and results of the proposed methodology. The section 6 presents conclusions and future work.

2 Background

This section describes the Tamil language and the Thirukkural. Tamil is a language that has been spoken since 300 BC. The language has undergone a radical transformation in terms of the script, grammar and word usage, and these changes have been reflected, correspondingly, in its literature. For the current generation that is unable to understand such ancient literature but is interested in reading it, simplified versions are readily available. To make it even more convenient, if those simplified versions are made available in an IR system like Google, alongside the original classic, a lot of people today will benefit from an understanding of the ancient language, its style, and its indubitable value. The proposed work intends to lay such a foundation.

The Thirukkural is a classic work of Tamil literature which focuses on ethics and moral values. It has been translated into more than 37 Indian and international languages, including English, Latin, French, German, Russian, Spanish, Chinese, Japanese, Arabic, and so on. It comprises 1330 couplets organized into 133 chapters and divided into three sections: “ (transliteration-Arattuppāl; translation-virtue)”, “ (transliteration-Porutpāl; translation-wealth)” and “ (transliteration-Kāmattuppāl; translation-love)”. Thirty-eight chapters are devoted to virtue, 70 to wealth, and 25 to love [7]. Each chapter has 10 couplets. Each couplet contains seven words known as “ (Cīr)”, with four cirs in the first line and three in the second. A cir is a crisp way of representing many words using one word.

The three sections offer all classes of human society invaluable advice. The proposed work aims at reorganizing the Thirukkural by classifying all 1330 couplets from the perspective of the classes of human society that are identified as superclasses. Table 1 shows the sections, and the respective superclasses for which the said sections are meant, in the existing organization of the Thirukkural. English transliteration and its meaning in English are given within parenthesis for the Tamil words used. The transliterations are done using the Google Translation. G.U.Pope’s explanation is used in all the Thirukkural couplets given in examples.

Table 1
Sections of Thirukkural and its superclasses

Sl. No. Sections Superclasses

1 (Arattuppāl-virtue) saint, common man, scholar, family man

2 (Porutpāl-wealth) king, scholar, minister, common man, friend, family man

3 (Kāmattuppāl-love) husband, wife, family man

Sl. No.	Sections	Superclasses
1	(Arattuppāl-virtue)	saint, common man, scholar, family man
2	(Porutpāl-wealth)	king, scholar, minister, common man, friend, family man
3	(Kāmattuppāl-love)	husband, wife, family man

It can be observed that the ten superclasses are scattered across all three sections. When Natural Language Processing (NLP) applications - such as an IR or a Summary Generation system - need to retrieve the Thirukkural specifically for a particular superclass, the Thirukkural couplets have to be grouped as per their classes for easier access by the above said NLP applications. Table 1 shows only nine superclasses. The proposed work groups the Thirukkural couplets that do not fall into these nine classes into a tenth superclass, titled general.

All advice in the Thirukkural can, generally speaking, be classified into what a person must, and must not, do. Further, the proposed framework has incorporated this classification by introducing two subclasses: TD and NTD for each superclass. Example 1 in Fig. 1 belongs to the superclass king and the subclass NTD.

Fig. 1

Example 1.

The Thirukkural couplet in Example 1 advises a king with a sizeable army not to engage in battle with a king who has a miniscule army. If he does, his pride will take a beating on account of his being made to look small. When a query is given to the IR system, the proposed work can aid in retrieving this Thirukkural couplet.

3 Related work

In this section, a survey of the literature on the related work is undertaken from two perspectives. Since the proposed work is a text classification framework, the related work on text classification is discussed. Given that the proposed framework classifies Tamil literature for a computational analysis later, this section focuses on computational work done on the literature.

3.1 Works related to text classification

In [9], Nunzio (2009) used the Bernoulli and MNB models for text classification into two classes. The term frequencies of the document were used as features. Their work was tested on 94380 English documents from four datasets: Reuters, 20 Newsgroups corpus, WebKB and the 7 Sectors database. Mean average precision was used as evaluation metrics and achieved 27%.

In [8], Capdevila and Florez (2009) proposed a framework for automatic text categorization using the Gaussian probabilistic classifier and classified text into 90 categories, with the unique words in the document functioning as features. Their work was tested on 10788 English documents from two datasets, 20 Newsgroups and the Reuters 21578 collection, with accuracy as evaluation metric.

In [15], Rajan et al. (2009) classified Tamil documents into five categories using the vector space model and artificial neural networks, with the most frequent words in the documents used as features. Their work was tested on 400 Tamil documents of the Tamil CIIL corpus with precision as the evaluation metric.

In [2], Al-Salemi and Aziz (2011) used the simple Naïve Bayes, multi-variant Bernoulli Naïve Bayes and MNB models to automatically categorize Arabic documents into four classes, with the most frequent terms in the documents serving as features. In-house collections of Arabic news were used as a dataset to test their work, in all on 3172 Arabic documents. Macro average is the average of F1-measure of all categories. Precision, recall, F1-measure and macro-average were used as evaluation metrics and achieved overall Macro-F1, 0.941.

In [16], Rizzo et al. (2017) classified research articles into two classes, relevant and non-relevant papers, using the MNB classifier. The classification assists researchers retrieve relevant papers for their research. Authors’ names, names of journals, journal references, abstracts, introductions and conclusions were used as features. Their work was tested on 2215 papers from the benchmark Systematic Literature Review dataset, with recall as the evaluation metric and achieved 95%.

In [1], Al-Badarneh et al. (2017) investigated different indexing approaches for Arabic text classification using the MNB classifier and identified five categories. Word frequencies were used as features in their 1000 normalized Arabic documents dataset. Micro average accuracy was used as evaluation metric and achieved 99.36%.

In [23], Xu (2018) classified a text into 20 categories, comparing algorithms such as the classical Naïve Bayes classifier as well as the Bernoulli and Gaussian event models. Word frequencies of the document were used as features. Their work was tested on 23020 English documents from the 20 Newsgroups and WebKB datasets. Precision, recall and F-measure were used as evaluation metrics.

In [6], Bahgat et al. (2018) classified the email into two categories using semantic methods. The Principal Component Analysis and Correlation Feature Selection techniques were used for feature selection. The benchmark Enron dataset was used for evaluating their work. The comparative study was performed with different machine learning techniques and 90% accuracy was achieved.

In [12], Luo (2021) used Naïve Bayes, Support Vector Machine and Logistic Regression machine learning algorithms to classify English documents. Word frequency, question mark, full stop, initial word and final word of the documents were used as features. The author has tested the work with 1033 English documents. Precision, recall and F-measure were used as evaluation metrics.

The existing text classification work has been done on English, Arabic and Tamil expository documents. The proposed work differs from these by attempting a text classification framework for a literature-type text in Tamil, and is the first of its kind. The usage of words in expository documents is different from literature-type text. For instance, the literature word “ (Ilavē - not)” is not commonly used in expository type of Tamil text. It is one of the words in negative feature set which is used for subclass classification. Handling of literature words is one of the challenges of the proposed work. Furthermore, the proposed work has attempted to reorganize the ancient Tamil literature, Thirukkural, to make the computational analysis easier. Next section describes other computational works done on Tamil literatures.

3.2 Computational work done on tamil literature

In [10], Elanchezhiyan et al. (2011) proposed the Kuralagam search engine for the Thirukkural, which retrieves couplets based on keywords, concepts and expanded query words. It retrieves couplets that are conceptually relevant to the query. Mean Average Precision was used as the evaluation metric and achieved the score of 0.83.

In [13], Madhavan et al. (2012) classified Tamil poems into four protocols called “Paa", using a rule-based approach, and context-free grammar to create the rules. Tamil poems have been parsed, an intermediate representation was created, and the poems were subsequently classified into four categories. They have achieved classification accuracy of 90%.

In [17], Sridevi and Subashini (2013) classified 11th century Tamil handwritten texts using the probabilistic neural network. Line, word, character segmentation and feature extraction were done before the classification, with structural and syntactic features. Testing involved the use of 500 characters, with accuracy as the evaluation metric. They have achieved the classification accuracy of 80.52%.

In [21], Subalalitha and Ranjani (2014) used a concept called Suthras, found in Sanskrit literature as well as in the Tamil grammar text, Nannool, for a crisp representation of texts. They attempted to merge these concepts with current text processing techniques like the rhetorical structure theory and universal networking language to identify the semantic indices of Tamil documents. The most frequently occurring words and synonyms were used as features. Their dataset comprised 1000 tourism-based Tamil documents, and was evaluated using the mean average precision metric and achieved the score of 0.7.

In [20], Subalalitha and Poovammal (2018) constructed an automatic bilingual dictionary for the Thirukkural using the Naïve Bayes machine learning algorithm. They used an English translation and commentary by G. U. Pope, alongside explanations in Tamil by Dr M. Varadharajan and Dr Solomon Pappaiya. Precision was used as an evaluation metric and achieved 70%.

In [4], Anita and Subalalitha (2019) proposed an approach to cluster Thirukkural couplets using discourse connectives as features. The K-means clustering machine learning algorithm was used. Cluster purity, the Rand index, precision, recall and F-score were used as evaluation metrics to obtain 79% purity, 92% overall Rand index, 79% precision, 80% recall and an F-score of 79%.

In [5], Anita and Subalalitha (2019) proposed a rule based approach to construct a discourse parser for the Thirukkural. They have used discourse connectives as the features. Precision and recall were used as evaluation metrics and achieved 81.5% precision and 81.86% recall.

In [19], Subalalitha (2019) proposed an information extraction scheme for the Tamil literary work, Kurunthogai. Details pertaining to food, flora, fauna, vessels, water bodies, noun unigrams, verb unigrams, adjective-noun bigrams and adverb-verb bigrams were extracted. A Tamil morphological analyzer tool was used to extract N-grams. Precision was used as the evaluation metric and obtained 88.8%.

It can be seen that substantial computational works have been done on Tamil literature. The proposed method differs from the above by attempting a yet-to-be-explored text classification of the Thirukkural.

4 Proposed work

The proposed system tries to combine the current classification of the three sections and 133 chapters into ten superclasses that represent the major human groups. King, saint, scholar, friend, minister, family man, common man, husband, wife, and general are the ten superclasses defined by the proposed work. To Do (TD) and Not To Do (NTD) are the two subclasses of the superclasses. These specify aspects/advice on what to do and what not to do, by the respective superclasses.

The architecture of the proposed work is shown in Fig. 2. The 1330 Thirukkural couplets are divided into a training set (80% of all couplets) and a testing set (20%). The training couplets are tagged into ten superclasses by semantically analyzing them. The testing couplets are classified into superclasses using MNB Classifier. Each superclass is further classified into two subclasses using morphological features [3] generated by the Morphological Analyzer, which is a tool developed at Anna University, Chennai.

Fig. 2

Architecture for the proposed methodology.

4.1 Super class classification

The MNB classifier is one of the algorithms for multi-class text classification. Since the MNB is a supervised classification algorithm, the text documents are labeled / categorized into predefined classes. The MNB builds a probabilistic model using the training set. The testing set is classified using the probabilistic model.

Let Tr be the training set Tr = {Tr1, Tr2, ... Trm}, where m = 1064 and Ts is the testing set. Ts = {Ts1,Ts2, ... Tsn}, where n = 266. The Thirukkural couplets in the training set are labeled manually. The superclasses king, saint, scholar, friend, minister, family man, common man, husband, wife, and general are denoted as C = {C1, C2, ... Ck}, where k = 10. The couplets in the Tr are mapped semantically into one of the classes of C, Tr⟶{C1, C2, ... Ck } during the training.

The construction of the probabilistic model involves the following steps.

Calculation of prior probabilities.

Calculation of conditional probabilities.

Calculation of posterior probabilities.

The prior probabilities are calculated for each superclass for the training set, Tr, using equation given in Equation (1). $P (C_{i}) = {NC}_{i} / m$ (1) where i = 1, 2, ... k, NC_i represents the number of Thirukkural couplets in Tr that belongs to C_i, and m represents the total number of couplets in Tr. For instance, P(king) is calculated as the number of couplets classified into king, divided by the total number of couplets. The prior probability is determined for all the ten superclasses.

After ascertaining the prior probability, the conditional probability is calculated. It is the probability of a word appearing in a particular superclass (C_i), using the equation given by Equation (2). Tr_j = w₁,w_2 …w_x, where x = 7. $P (w | C) = (n_{w, c} + 1) / (n_{c} + | U |)$ (2) where n_w,c is the total number of occurrences of a word, w, in couplets belonging to class C, n_c is the total number of words in the couplets of class C, and |U| is the number of unique words in Tr. The conditional probabilities are calculated for each word in the couplets of the training and testing sets for all the ten superclasses. The word may have a maximum conditional probability value for a particular superclass than for other superclasses. For example, the word “ (Vēntar-King)” has a higher conditional probability for the superclass king than for other classes.

The posterior probability is calculated for the couplets in the testing set by Equation (3). Tsy = w1,w2 ... wx, where x = 7. $P (C_{i} | {Ts}_{y}) = P (C_{i}) * \prod_{j = 1}^{x} P (wj | Ci) .$ (3) $Class (Ts) = arg max P (C_{i} | {Ts}_{y})$ (4)

The couplets in the testing set are classified by Equation (4). Example 2 in Fig. 3 is categorized as superclass king, since this couplet contains terms like, “ (Ceṅkōl –Scepter) ”, “ (Vēntar-King)”, and “ (Kotai –Donation)” have a higher probability for the superclass king than other classes.

Fig. 3

Example 2.

4.2 Sub class identification

The test set has been classified into 10 superclasses. Each superclass, C, now needs to be further classified into two subclasses, TD and NTD. Since the Thirukkural literature explains what a person must do and must not do by the superclasses, the feature set is constructed with the negative words used in the Thirukkural. This feature set for subclass identification is identified using a morphological analyzer [3], which identifies both the morphemes and Parts of Speech tags. Morphological features, such as adjectives, such as “ (Ketta –Bad)”, adverbs, such as “ (Inri –Without)”and negative finite verbs, such as “ (Alla - Not)” are considered feature set for subclass identification.

Algorithm 1 describes the subclass identification. The Thirukkural couplet is analyzed for the negative word by searching the Feature_set. If there is no feature from the Feature_set occurring in the couplet, the Thirukkural couplet belongs to the positive subclass, TD. If a word (W_i) of a Thirukkural couplet is matched with the word in the Feature_set, the three words preceding W_i (W_i - 1,W_i - 2,W_i - 3), and the three words succeeding W_i (W_i +1,W_i +2, W_i +3) must be analyzed for negative words, since each Thirukkural couplet contains 7 words. If there is a negative word, the couplet belongs to the positive subclass TD, otherwise it belongs to the negative subclass, NTD.

Algorithm 1: Subclass Identification
1. Construct featureset using morphological analyzer, Feature_set={f1,f2, ... }
2. for each Ts_yin Ts ={ Ts₁, Ts₂, … Ts_n }, where Ts is a testing set
3. if W_i in Ts_y ∈ Feature_set, where W_i is the i^th word of the Thirukkural couplet y
4. if (W_i–1 ∈ Feature _ set ∥ W_i-2 ∈ Feature _ set ∥ W_i-3 ∈ Feature _ set) ∥
5. (W_i+1 ∈ Feature _ set ∥ W_i+2 ∈ Feature _ set ∥ W_i+3 ∈ Feature _ set)
6. Assign Ts_y to TD subclass
7. else
8. Assign Ts_y to NTD subclass
9. else
10. Assign Ts_y to TD subclass

Example 3 in Fig. 4 illustrates how subclass identification is carried out. In Example 3, there is no negative word that is present in the Feature_set, so the couplet belongs to the TD subclass.

Fig. 4

Example 3.

In Example 4 in Fig. 5, the word “ (Illai-No)” is W₅ ∈ Feature_set and there is no other negative word that is present in Feature_set in surrounding, so the couplet belongs to the NTD subclass.

Fig. 5

Example 4.

In Example 5 in Fig. 6, the word “ (innata - unpleasant)” is W₂ ∈ Feature_set and the word “ (Illai -No)” is W₃ ∈ Feature_set, so the couplet belongs to the TD subclass. The word “” contains the word “’ which is separated using the morphological analyzer tool.

Fig. 6

Example 5.

5 Experiments and results

5.1 Dataset details

The dataset is tested using the inter-rater reliability [11, 18]. The inter-rater reliability is a test validity method used to measure the score given by the human experts. In our dataset, three human experts found the Thirukkural couplets category.

Table 2 describes the percentage agreement for the superclass categories of the Thirukkural couplets. We have used 0 to 9 to represent the superclasses king, saint, scholar, friend, minister, family man, common man, husband, wife, and general respectively. All the annotators classified the couplet 1 as 9, which is general superclass, and the agreement is 100 percentage. Similarly, inter-rater reliability is calculated for all the Thirukkural couplets. The next section describes the implementation and analysis of the results.

Table 2
Percentage agreement across multiple annotators

Thirukkural Couplets Annotator 1 Annotator 2 Annotator 3 % Agreement

couplet 1 9 9 9 100

couplet 2 2 2 2 100

couplet 3 6 6 9 66.66

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

couplet 1330 7 7 7 100

inter-rater reliability 89.44

Thirukkural Couplets	Annotator 1	Annotator 2	Annotator 3	% Agreement
couplet 1	9	9	9	100
couplet 2	2	2	2	100
couplet 3	6	6	9	66.66
...	...	...	...	...
...	...	...	...	...
...	...	...	...	...
couplet 1330	7	7	7	100
inter-rater reliability	89.44

5.2 Evaluation

The training set consists of 1064 Thirukkural couplets (80% of all couplets) and the testing set consists of 266 (20% of all couplets). The proposed approach has been evaluated using the metrics of precision, recall and F-measure, calculated as in Equations (5), (6) and (7) [14]. $\begin{matrix} Precision (P) = \\ \frac{Number of Thirukkural couplets correctly classified, (C)}{\begin{matrix} Total number of Thirukkural couplets classified, (M) \end{matrix}} . \end{matrix}$ (5) $\begin{matrix} Recall (R) = \\ \frac{Number of Thirukkural couplets correctly classified, (C)}{\begin{matrix} Total number of relevant Thirukkural couplets to be classified, (N) \end{matrix}} . \end{matrix}$ (6)

F-Score is given by the following formula. $F - Score = \frac{2 PR}{P + R} .$ (7)

The value of the variable, M, is calculated automatically, whereas the C and N values are calculated using human judgment. About three domain experts have calculated these metrics, and the average has been taken and presented in Tables 3 and 5.

Since the proposed work has attempted to merge the three sections and 133 chapters into ten superclasses, it is seen from Table 3 that the superclass common man dominates the rest of the classes and the superclass, Saint is least observed. Table 4 depicts the number of chapters in which each superclass is focused on. Table 4 shows that the proposed framework retains the integrity of the contents even after the reorganization of the existing Thirukkural.

Table 3

Precision, Recall and F-Score for the Superclass classification

Superclasses	N	M	C	P %	R %	F
Common man	101	82	72	87.8	71.29	78.69
family man	51	79	50	63.29	98.04	76.92
king	24	20	18	90	75	81.82
scholar	10	8	7	87.5	70	77.78
friend	12	10	9	90	75	81.82
minister	8	6	5	83.33	62.5	71.43
saint	5	7	5	71.43	100	83.33
husband	11	10	9	90	81.82	85.72
wife	16	18	15	83.33	93.75	88.23
general	28	26	23	88.46	82.14	85.18

Table 4

Chapter-Super class mappings

Sl.No.	Super Classes	Number of Chapters
1	Common man	35
2	family man	25
3	king	22
4	scholar	21
5	friend	20
6	minister	18
7	saint	15
8	husband	16
9	wife	21
10	general	20

The prior probability and conditional probability are higher for the superclasses that are present in the most number of chapters. This is the reason for the decrease in precision, recall and F-score metrics. It can be alleviated by adding weights to words that decide the superclasses. Since a couplet has only seven words, the proposed approach has considered all words to be equal.

The Thirukkural couplet in Example 6 in Fig. 7 must be classified as superclass king. The probabilistic model is based on frequency of words. This couplet contains the word “ (Il –House)” related to the superclass family man, two times and other words are also occurring more number of times in superclass family man than superclass king. Hence this Thirukkural couplet is classified as family man instead of king.

Fig. 7

Example 6.

The superclasses are classified into two subclasses: TD and NTD. The precision, recall and F-score values are found using Equations (5), (6) and (7), and listed in Table 5. The precision, recall and F-score metrics obtained for subclass identification depends on the correctness of the morphological analyzer. During training, certain words are not correctly split into morphemes.

Table 5

Precision, Recall and F-Score for the Subclass classification

Super classes	Sub classes	N	M	C	P %	R %	F
Common man	TD	54	54	47	87.04	87.04	87.04
	NTD	21	28	19	67.86	90.48	77.55
family man	TD	45	45	43	95.56	95.56	95.56
	NTD	35	34	30	88.24	85.71	86.96
king	TD	16	14	12	85.71	75	80
	NTD	6	5	4	80	66.67	72.73
scholar	TD	4	4	3	75	75	75
	NTD	6	4	4	100	66.67	80
friend	TD	7	7	5	71.43	71.43	71.43
	NTD	5	3	3	100	60	75
minister	TD	2	3	2	66.67	100	80
	NTD	3	3	2	66.67	66.67	66.67
saint	TD	10	7	6	85.71	60	70.59
	NTD	2	1	1	100	50	66.67
husband	TD	10	9	8	88.89	80	84.21
	NTD	1	1	1	100	100	100
wife	TD	9	13	8	61.54	88.89	72.73
	NTD	6	5	4	80	66.67	72.73
general	TD	14	16	14	87.5	100	93.33
	NTD	10	10	7	70	70	70

Example 7 in Fig. 8 must be classified as NTD subclass. Some literature words are not correctly separated by the Morphological analyzer, for instance, it did not split the word “ (Peyarār –Not change)”from “ (Tāmpeyarār –Who do not change)” and is classified as TD subclass.

Fig. 8

Example 7.

The results of the machine learning algorithms, such as, MNB, Support Vector Machine (SVM), Linear Regression (LR) and Random Forest (RF) are compared. The precision, recall and F-score values of the algorithms are shown in Figs. 9, 10 and 11.

Fig. 9

Comparison of algorithms using precision.

Fig. 10

Comparison of algorithms using Recall.

Fig. 11

Comparison of algorithms using F-score.

It can be observed from the comparison that the MNB algorithm gives good results than the rest of the algorithms. The reason behind this is that MNB works better in small datasets and it is based on frequency of words.

An IR system has been developed on top of the proposed framework to assess its efficiency, and performance has been compared to Google search and a locally built search that does not use the proposed classification framework (Search without classification). The proposed work is evaluated using the precision (P), average precision (AP), and mean average precision (MAP) metrics computed using Equations 8, 9, and 10 [22]. $\begin{matrix} = \\ \frac{Number of Thirukkural couplets correctly retrieved}{\begin{matrix} Total number of Thirukkural couplets retrieved \end{matrix}} \end{matrix}$ (8)

Equation 8 is used to calculate the precision of the Thirukkural couplets. Equation 9 yields AP@10, which is the average precision for the top 10 retrieved results. $AP @ 10 = \frac{1}{\begin{matrix} TP \end{matrix}} \sum_{k = 1}^{10} P @ k * R @ k .$ (9) where TP denotes the number of relevant Thirukkural couplets retrieved; k@10 denotes the top 10 results retrieved from the search system; P@k denotes the precision at the kth retrieved result; and R@k denotes the relevance at k. If a result is relevant, it is assigned a value of 1; otherwise, it is assigned a value of 0. Equation 10 is used to calculate the MAP, which is the mean of all queries’ AP@10. $MAP = \frac{1}{N} \sum_{i = 1}^{N} AP i$ (10)

N is the number of queries, which in this case is 15. The proposed supervised classification-based search method has a MAP score of 0.89, compared to 0.59 for Google search and 0.68 for Search without classification.

The average precision values for Google search, Search without classification, and Classification-based search are shown in Table 6, which shows an output comparison of the proposed approach with the other two, where Qi denotes query i. Because of the use of superclass and subclass classification, the precision and MAP scores obtained by the proposed work are higher than those of Google search and Search without classification.

Table 6

Performance Comparison using average precision values

Queries	Google search	Search without classification	Classification-based search
Q1	0.69	0.73	0.86
Q2	0.55	0.6	0.91
Q3	0.7	0.77	0.85
Q4	0.44	0.53	0.98
Q5	0.6	0.7	0.93
Q6	0.58	0.71	0.95
Q7	0.46	0.53	0.96
Q8	0.73	0.8	0.94
Q9	0.6	0.71	0.89
Q10	0.55	0.72	0.84
Q11	0.7	0.77	0.81
Q12	0.57	0.65	0.86
Q13	0.58	0.72	0.91
Q14	0.52	0.55	0.91
Q15	0.53	0.67	0.82

Keyword matching is at the heart of both Google search and Search without classification. In most cases, Google retrieves the entire chapter of Thirukkural that matches the keywords in the query, ignoring the semantically related ones. For example, if a query contains the word “ (Natpu –Friendship)”, Google will return a chapter on “ (Natpu –Friendship)”, as well as a chapter on “ (Tīnatpu –Evil Friendship)”, from the Thirukkural, explaining why the MAP score is so poor. The Thirukkural couplets matching the keywords present in the query, as well as their synonyms, are retrieved in Search without classification, missing the relevant words. For instance, if a query contains the term “ (Aracan - King)”, the Search without classification retrieves the Thirukkural couplets matching the keyword “” (Aracan - King)”, as well as their synonyms “ (Mannan - King)”, “ (Vēntaṉ - King)” and “ (Maṉṉavaṉ - King)”. It does not retrieve Thirukkural couplets matching words related to “ (Aracan - King)” such as “ (Aran - Bulwark)”, “ (Ceṅkōl –Scepter), “ (Patai –army)”, and so on, which are retrieved in classification-based search because these words have higher probability values for the superclass king.

6 Conclusions and future work

This paper has attempted to reorganize the existing Tamil classic literature, Thirukkural, by proposing new set of 10 superclasses for building an efficient search system. The proposed approach has used the MNB classifier to classify Thirukkural couplets into ten superclasses. They have been further classified into two subclasses capturing the didactic essence of Thirukkural.

The efficiency of the proposed classification framework is tested by building an IR on top of proposed classification framework. The performance of the proposed IR systems has been evaluated using the MAP score and compared with the traditional search without classification and Google search. The results (MAP score of 0.89) were better than those produced by state of the art approaches and were largely driven by the classification framework. In order to justify the choice of MNB classifier, the MNB algorithm is compared with the SVM, LR and RF algorithms.

The proposed approach can be extended to other unexplored Tamil classics such as the Kurunthogai, Purananooru, and Naladiyar by finding an apt semantic representation for these literatures.

References

Al-Badarneh

, Al-Shawakfa

, Bani-Ismail

, Al-Rababah

and Shatnawi

, The impact of indexing approaches on Arabic text classification, Journal of Information Science 43(2) (2017), 159–173.

Al-Salemi

and Aziz

M.J.

, Statistical bayesian learning for automatic Arabic text categorization, Journal of Computer Science 7(1) (2011), 39.

Anandan

, Saravanan

, Parthasarathi

and Geetha

T.V.

, Morphological analyzer for Tamil, In International Conference on Natural language Processing (2002).

Anita

and Subalalitha

C.N.

, An Approach to Cluster Tamil Literatures Using Discourse Connectives, In 2019 IEEE 1st International Conference on Energy, Systems and Information Processing (ICESIP) 2019 Jul 4 (pp. 1–4). IEEE.

Anita

and Subalalitha

C.N.

, Building Discourse Parser for Thirukkural, In Proceedings of the 16th International Conference on Natural Language Processing (ICON-2019) IIIT Hyderabad, India: NLP Association of India, (2019), 18–25.

Bahgat

E.M.

, Rady

, Gad

and Moawad

I.F.

, Efficient email classification approach based on semantic methods, Ain Shams Engineering Journal 9(4) (2018), 3259–3269.

Blackburn

, Corruption and redemption: The legend of Valluvar and Tamil literary history, Modern Asian Studies 34(2) (2000), 449–482.

Capdevila

and Florez

O.W.

, A communication perspective on automatic text categorization, IEEE Transactions on Knowledge and Data Engineering 21(7) (2009), 1027–1041.

Di Nunzio

G.M.

, Using scatterplots to understand and improve probabilistic models for text categorization and retrieval, International Journal of Approximate Reasoning 50(7) (2009), 945–956.

10.

Elanchezhiyan

, Geetha

T.V.

, Ranjani

and Karky

, Kuralagam - Concept Relation based Search Engine for Thirukkural. In Tamil Internet Conference, University of Pennsylvania, Philadelphia, USA, (2011), 19–23.

11.

Gwet

K.L.

, Computing inter-rater reliability and its variance in the presence of high agreement, British Journal of Mathematical and Statistical Psychology 61(1) (2008), 29–48.

12.

Luo

, Efficient english text classification using selected machine learning techniques, Alexandria Engineering Journal 60(3) (2021), 3401–3409.

13.

Madhavan

K.V.

, Nagarajan

and Sridhar

, Rule based classification of tamil poems, International Journal of Information and Education Technology 2(2) (2012), 156.

14.

Miao

Y.L.

, Cheng

W.F.

, Ji

Y.C.

, Zhang

and Kong

Y.L.

, Aspect-based sentiment analysis in Chinese based on mobile reviews for BiLSTM-CRF, Journal of Intelligent & Fuzzy Systems. (Preprint) 1–1.

15.

Rajan

, Ramalingam

, Ganesan

, Palanivel

and Palaniappan

, Automatic classification of Tamil documents using vector space model and artificial neural network, Expert Systems with Applications 36(8) (2009), 10914–10918.

16.

Rizzo

, Tomassetti

, Vetro

, Ardito

, Torchiano

, Morisio

and Troncy

, Semantic enrichment for recommendation of primary studies in a systematic literature review, Digital Scholarship in the Humanities 32(1) (2017), 195–208.

17.

Sridevi

and Subashini

, Combining Zernike moments with regional features for classification of handwritten ancient tamil scripts using extreme learning machine, In 2013 IEEE International Conference ON Emerging Trends in Computing, Communication and Nanotechnology (ICECCN) 2013 Mar 25 (pp. 158–162). IEEE.

18.

Srinivasan

and Subalalitha

C.N.

, A Thesaurus Based Semantic Relation Extraction for Agricultural Corpora, In International Conference on Computational Intelligence in Data Science 2020 Feb 20 (pp. 99–111). Springer, Cham.

19.

Subalalitha

C.N.

, Information extraction framework for Kurunthogai, Sīdhanī 44(7) (2019), 1–6.

20.

Subalalitha

C.N.

and Poovammal

, Automatic bilingual dictionary construction for Tirukural, Applied Artificial Intelligence 32(6) (2018), 558–567.

21.

Subalalitha

C.N.

and Ranjani

, A unique indexing technique for discourse structures, Journal of Intelligent Systems 23(3) (2014), 231–243.

22.

Tulbure

A.A.

, Tulbure

A.A.

and Dulf

E.H.

, A review on modern defect detection models using DCNNs–Deep convolutional neural networks, Journal of Advanced Research (2021).

23.

, Bayesian Naïve Bayes classifiers to text classification, Journal of Information Science 44(1) (2018), 48–59.

A Novel classification framework for the Thirukkural for building an efficient search system

Abstract

Keywords

1 Introduction

2 Background

Table 1 Sections of Thirukkural and its superclasses Sl. No. Sections Superclasses 1 (Arattuppāl-virtue) saint, common man, scholar, family man 2 (Porutpāl-wealth) king, scholar, minister, common man, friend, family man 3 (Kāmattuppāl-love) husband, wife, family man

3.1 Works related to text classification

3.2 Computational work done on tamil literature

4 Proposed work

5.1 Dataset details

Table 2 Percentage agreement across multiple annotators Thirukkural Couplets Annotator 1 Annotator 2 Annotator 3 % Agreement couplet 1 9 9 9 100 couplet 2 2 2 2 100 couplet 3 6 6 9 66.66 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... couplet 1330 7 7 7 100 inter-rater reliability 89.44

References

Table 1
Sections of Thirukkural and its superclasses

Sl. No. Sections Superclasses

1 (Arattuppāl-virtue) saint, common man, scholar, family man

2 (Porutpāl-wealth) king, scholar, minister, common man, friend, family man

3 (Kāmattuppāl-love) husband, wife, family man