Abstract
The incredible expansion of online texts due to the Internet has intensified and revived the interest of sorting, managing and categorising the documents into their respective domains. This shows the pressing need for automatic text categorization system to assign a document into its appropriate domain. In this article, the focus is on showcasing the effectiveness of a hybrid approach that works elegantly by combining text-based and graph-based features. The hybrid approach was applied on 14,373 Bangla articles with 57,22,569 tokens collected from various online news corpora covering nine categories. This article also presents the individual application of both the features to explicate how they generally work. For classification purposes, the feature sets were passed through the Bayesian classification methods which yield satisfactory results with 98.73% accuracy for Naïve Bayes Multinomial (NBM). Also, to test the robustness and language independency of the system, the experiments were performed on two popular English datasets as well.
Keywords
1. Introduction
Text classification (TC) is evolving rapidly and is one of the major fields of research in Natural Language Processing (NLP). It is widely studied in the current era when numerous data are available online, in many languages around the world. It is a supervised categorization process, where text documents are automatically classified into predefined categories. To deal with the problem of manual indexing or sorting a large volume of increasing data in day-to-day life, there is a great necessity for developing an automatic text classification system. A TC can be of various types: binary (classifying the texts into one of the two domains), multi-class (classifying the texts into one domain and the possible number of domains in the set are more than two) and multi-labelled (classifying the texts into more than one domains) and can play a prime role in a wide array of areas including text information retrieval [1], sentiment analysis [2], word sense disambiguation [3], question-answering systems [4] as well as other applications involving text document organisation. Categorization often depends on a domain for which texts are predefined, and relationships are identified based on the terms, the text documents consist of.
Within the past several years, the Internet has begun to dominate the landscape of computing. The exponential growth of the Internet has led to a massive increase in digital text documents. There is a pressing need for efficient tools and software to help users searching for useful information. This has paved the way for text classification that assigns the documents to their respective text categories. In essence, text document classification or text categorization is one of the core technologies having a strong impact on the development of language engineering as well as having strong application relevance in the commercial world. It is associated with many challenges and difficulties relating to both the text and the content of the documents. First, it is observed that there are difficulties in dealing with the semantics and abstract concepts of natural language texts from a few sets of keywords. Second, high dimensionality and variable length of text documents, along with varied content and quality, create hurdles in achieving efficiency and accuracy of an automatic text classification system. Although a considerable amount of work is present in literature in English and other advanced languages, the case is not the same for Indian languages, especially for Bangla. One reason for this is the inherent complexity of Bangla which is accompanied by the unavailability of standard datasets. Bangla, a language of the Indo-European family, is the national language of Bangladesh and the state language of West Bengal, India. It is written using Bangla script, a unique writing system that is also used for the Assamese, another language of India. As far as the information on the Internet is available, it is the sixth most spoken language (population-wise) worldwide and third most spoken language in India [5]. This creates a pressing need to organise Bangla text documents effectively so that users can easily recognise required or related documents. Automatic Bangla text categorization system – that the present work desires to develop – will help the language to move little far towards the state of digitisation as well as the urgent needs of the Bangla text users who are waiting eagerly for a reliable system for retrieving useful information in this language will be addressed.
2. Our contributions
We proposed a tool that can classify Bangla text documents. The major contributions of the work are summarised as follows.
The Bangla text document classification has not been well studied (Section 3). In the proposed technique, we have combined the two features (text-based and graph-based features) for the extraction of features. Furthermore, we have implemented Naïve Bayes Multinomial – an extended version of standard Naïve Bayes algorithm – that has widely been studied for classification of English text documents but not for Bangla.
Since there are no publicly available datasets in Bangla, we developed our datasets from several online news corpora (see Section Data collection). The dataset will be provided for research purposes (upon the request).
Having an automatic Bangla text categorization tool can address the need for retrieving useful information from the digital Bangla text documents for Bangla text users.
The remaining part of this article is sketched in the following manner: Section 3 refers to the related works on text classification; in Section Proposed methodology, the proposed methodology obtained for attaining the task have been discussed; Section Result and analysis casts lights on the results achieved using the classifier being chosen along with the statistical significance test as well as comparison with existing methods, and finally Section Conclusion and future work draws conclusion along with the future scopes.
3. Related works
TC is elaborately researched and gained attention for several resourceful languages especially in English, Chinese, and Arabic; but very few works have been done for Indian languages especially in Bangla. Some of the recent works have been discussed. Parlak and Uysal [6] performed an exhaustive comparison for categorising the abstracts obtained from Turkish and English medical journals. They extracted the features using three methods: uni-gramme, bi-gramme and hybrid (combination of uni-gramme and bi-gramme). Then, they have used three distinct feature selection approaches and seven classification algorithms in their work. The experimental results show that the unigram approach along with distinguishing feature selector (DFS) and Multinomial Naïve Bayes (MNB) classifier performed better on both the datasets. Lee et al. [7] proposed a new approach capable of handling thematic flows of texts with Deep Neural Networks (DNNs) so that the converse knowledge and disperse presentations of texts are integrated using Rhetorical Structure Theory (RST)-based discourse analysis. They tested their experiments for two document-level text classification, sentiment analysis and sarcasm detection tasks. Malliaros and Skianis [8] introduced a graph-based model where each text was determined by a graph that depicts the relationships among the words. Their system is capable of capturing the relationships among the words that coexist in a text thus developing a feature set. The significance of a word in a text was specified using graph theoretic vertex centrality measure. They tested the experiments in standard datasets using some classification algorithms. Rousseau et al. [9] represents the documents as graph-of-terms where they have extracted relevant features corresponding to larger n-grammes using iterated sub-graph mining. Based on the idea of k-core, they have reduced the graph to its main core for generating the final feature vector. Experiments were performed on four standard datasets which shows better performances compared with the state of the art methods. Li et al. [10] worked with a combined feature pruning technique for reducing the feature set depending on rudimentary investigation of the existing works. Then the neural architecture was trained and used to categorised 9804 Chinese documents from 20 domains. Luo et al. [11] used N-gramme feature where N is uni, bi and the combination of the two for extraction of features. They have further proposed a method depending on feature extraction and reduction approaches to enhance the performance of their system. Wu et al. [12] proposed a hybrid system by generating a lexical stock of a certain dimension and then estimates each term based on 1-of-m encoding. Then the convolutional neural network (CNN) was used to obtain the morphological features of character vectors from each word, and then through large scale text material training the semantic feature of each word vectors are be obtained the semantic feature of each word vectors. Finally, the text classification is carried out with the SVM multiple classifier. Al-Tahrawi [13] used Alj-News corpus includes 1500 Arabic news documents uniformly partitioned into 5 distinct classes from which 240 documents were used for training and 60 documents for testing. The author used the chi-square method in their experiment for the categorization task. Ali and Ijaz [14] provided comparisons among various statistical measures for categorising Urdu documents using NB and SVM. They have applied language-dependent preprocessing techniques for developing the final feature vector. From the experiment, it can be observed that on increasing the value of n up to 3 the performance increases, which decreases for the further increase from 3 to 4.
Lack of research materials in TC for Indian languages especially for Bangla shows that there is a pressing need for more research in this domain. Gupta and Gupta [15] proposed a hybrid classification algorithm by combining NB and Ontology-based classification techniques. They executed their approach on 184 Punjabi news articles collected from 7 sub-classes of sports category. Out of all, 50 text documents were used for training and remaining for testing. Their study showed that Hybrid classification proved to be better than other approaches. ArunaDevi and Saveeth [16] proposed a C-feature extraction model containing a pair of terms for Tamil text classification. The automatic Marathi documents text categorization technique based on the user’s profile was presented by Patil and Bogiri [17]. The system uses the LINGO approach using the vector space model. They tested their approach to 200 articles from 20 domains. Their experiment shows the efficiency of the LINGO clustering algorithm for the classification of Marathi text documents. Patil and Game [18] applied four classifiers among which NB proved to be most efficient concerning time and accuracy. Islam et al. [19] used tf-idf and SVM in their experiment and obtained 92.57% accuracy for the categorization of articles from 12 domains. Alam and Islam [20] used tf-idf and neural network for categorization of Bangla text documents and obtained a precision of 0.96. Hassan et al. [21] proposed a text classification system based on term-graph model for categorising texts from 12 domains. They obtained an average accuracy of 90.99% using KNN algorithm.
4. Proposed methodology
This article aims to show how the combination of text-based and graph-based features can work wonders compared with the individual application of the mentioned features in classifying Bangla text documents collected from various web news corpora. The motive behind proposing these two features is exploring its use in the text classification task for Bangla as there are very few techniques available in the Bangla text categorization task. The outline of the present methodology is shown in Figure 1 that shows different stages being adopted for the categorization of Bangla news articles whose detailed descriptions are presented in the following sections. Also, the comparison between several classification algorithms has been provided to analyse the system’s performance.

The outline of the proposed methodology.
4.1. Data collection
Data are the most crucial part of any experiment whose collection needs care and efficiency so that errors cannot be noticed in further processing. Since from a literature study, it can be seen that no domain-based Bangla texts data are available publicly except one or two Bangla corpus, which are not adequate to be taken into account for experimenting. So, we had to build our dataset for the experiment. A total of 14,373 texts were obtained for the present work from various online Bangla news corpora, webpages and magazines from nine domains as stated by leading newspapers using web crawler tools (import.io). The associated images, links and other graphic contents from news corpora and magazines were discarded and only the text portions were considered and saved in UTF-8 format that were carried forward for further processing. The links used for extracting Bangla text documents are provided in Dhar et al. [22], while the distribution of text documents for each of the nine domains is illustrated in Table 1.
Distribution of text documents among the domains.
In the real-world scenario, it is often observed that there are disparate documents which are partially available, that is, a portion of it is missing. It is essential for text categorization systems to be able to handle such documents. To test the performance of our system for such a scenario, three datasets D1, D2, D3 and D4 were engendered from the original data. Each of the files was divided into two halves and D1 was composed by considering the first half of each of the files. D2 was composed by considering the last half of each of the files. D1 represents such a scenario where texts are missing from the beginning. D2 represents a scenario where texts are missing from the end. D3 was composed by considering first half of the files of D1 and the last half from D2. D4 was composed by considering the first half from D2 and the last half from D1. This was done to test our system for a scenario where there are some files whose information is missing from the start and again there are some files whose information is missing from the end. The partition of the dataset is depicted through a block diagram (Figure 2).

Flowchart of the partitioned data.
4.2. Pre-processing
Digital texts comprised a series of characters, words, and phrases. Normally, before further processing, the sentences of the raw articles need to be broken into linguistic units and the task of breaking the sentences into such units is known as ‘tokenization’ and the unit as ‘token’. Tokenization can be represented as the sense of identifying the tokens that are required for further processing. Without the proper segregation of tokens, carrying out the analysis is impossible. The sentences were segregated using ‘space’ as the delimiter to obtain the results of the individual token to be 57,22,569 tokens in total. Since the token set consists of the indispensable and dispensable tokens, thus the removal of stopwords is needed for cleaning and filtering the data. The tokens that do not directly participate in classifying the text document to its category are termed as ‘stopwords’ and they are problem-specific. In this case, the list of stopwords used is provided in Stopwords [23]. After performing the pre-processing task, the number of tokens results in 44,47,689 items. However, the selection of stopwords requires special care as in some cases the stopwords may act as a prime source of information which can be a drawback for any experiment. So, this is a challenging task while carrying out any experiment with text data.
4.3. Hybrid Feature
The generation of an appropriate feature sets can be represented as wrapper and filter approach as stated in Karegowda et al. [24]. In the first approach, the feature selection depends on the classifier being considered in the experiment because to measure the importance of features set the method of classification is itself used. However, in Filter Approach, there is no dependency on training a model rather the feature selection depends on fast computation. Using the filter method, after getting the feature selection done, the feature set can be provided to any classification algorithm according to the need of the experiments for classification purposes. The different approaches of feature selection have a great influence on the performance of text classification tasks. Here, the text-based feature, graph-based feature and the combination of the two features have been used as a filter approaches where the feature set generated by these methods used by the classifier as input. The three feature selection methods used in the experiment are briefly described in the following.
4.3.1. Text-based feature
In the present work, a new text-based feature (token probability estimation) was proposed and used to assign weights to each of the tokens. This token probability estimation (TPE) estimates the term distribution in a document. In our pilot study, it has been seen that the standard weighting schemes
where
where
4.3.2. Graph-based feature
The standard text representation–based features are unable to deal with the inversion of tokens and their equivalent subsets like ‘article about news’ versus ‘news article’. Thus, it is believed by several researchers that graph-based techniques can overcome this problem and hence gaining a lot of attention nowadays. Various techniques have been introduced in other fields [26] but require attention in text categorization tasks for different languages. Thus, this article aims to explore the usefulness of the graph-based feature for the Bangla text categorization task.
In this experiment, a graph-based feature was proposed that uses a weighted graph algorithm described as follows:
A graph Gp was can be defined by 3 tuples: Gp = (Vt, Eg, Wg), where Vt defines the set of vertices, Eg defines the edges connecting Vt and Wg defines the set of weights assigned to the Ed.
Node: Unique words being extracted from the articles in the dataset.
Edge: Developed using the frequency of tokens in the articles. If two words arise in a certain article then an edge was assigned between those words.
Weight Vector: It can be defined by the following equation (5), where Wg(s, t) denotes the weight provided to token s and token t, f(s, t) defines the presence of tokens s and t within an article, f(s) defines the presence of token s and f(t) denotes the presence of token t individually in that respective article. The higher the weight Wg(s, t) represents the stronger edge between two vertices
4.4. Bayesian Classifier
In the present experiment, the Naïve Bayes Multinomial (NBM) classification algorithm is initiated to train the model based on the feature values being selected for classifying the text documents. The motivation behind choosing this classifier is its excellence and the promising performance in earlier work of text classification by Kim et al. [27], Kibriya et al. [28], Rehman et al. [29] and Wilbur and Kim [30]. Mainly NBM has been widely used for the English language but never been explored for Bangla text documents. It is novel to study how it works for classifying Bangla text documents into their respective text categories. Thus, in this study, NBM was used to explore its efficiency in developing an automatic Bangla text categorization system. The Naïve Bayes Multinomial is a specialised version of Naïve Bayes mostly designed by keeping the nature of text documents in mind. It generally computes the likelihood of an article to a respective domain using equation (6) where
NBM counts the token and regulates the underlying assumptions into it. The extended form is suitable in a scenario where a token may occur for multiple numbers of times. This variation estimates the conditional likelihood
5. Result and analysis
For this experiment, WEKA [31] which is considered to be an extremely popular open-source classification tool, was used in applying the mentioned classifier. K folds cross-validation was applied to the articles where K was chosen to be 5. Cross-validation evaluates the predictive models where the original sample was distributed among the train and test set used for evaluation. Normally, in K folds cross-validation, the original dataset is segregated into K equal size groups out of which 1 group is provided for validation and the remaining groups are provided as a train set. This process is iterated over K times with each of the K groups provided for validation exactly once. The benefit of using this scheme is that every combination is provided as train and validation data with each group being provided for validation exactly once.
Using a text-based feature and NBM algorithm, we have obtained an accuracy of 96.62% for five-fold cross-validation. Different cross-validation folds (5, 10, 15 and 20) were analysed for further experiments and the result is provided in Table 2 where it can be seen that the maximum accuracy was obtained for five-fold cross-validation.
Accuracy for different cross-validation folds using text-based feature.
Bold: Highest value.
The domain-wise accuracy obtained by text-based feature for five-fold cross-validation is depicted through Figure 3.

Domain-wise accuracy obtained by text-based feature using NBM.
Next, we have experimented using a graph-based feature along with the NBM algorithm for five-fold cross-validation scheme and achieved an accuracy of 97.86%. Again different cross-validation folds (5, 10, 15 and 20) were analysed for further experiments for a graph-based feature and the result is provided in Table 3 from which it can be observed that the maximum accuracy was obtained for 10-fold cross-validation.
Accuracy for different cross-validation folds using graph-based feature and NBM.
NBM: Naïve Bayes Multinomial.
Bold: Highest value.
The domain-wise accuracy obtained by the graph-based feature using 10-fold cross-validation is depicted through Figure 4.

Domain-wise accuracy obtained by graph-based feature using NBM.
Finally, two features (text-based and graph-based features) were combined and tested with the NBM algorithm using five-fold cross-validation and obtained an accuracy of 98.73%. Different cross-validation folds (5, 10, 15 and 20) were analysed for further experiments for a combined feature and the result is provided in Table 4 from which it can be observed that the maximum accuracy was obtained for five-fold cross-validation. Also, the experiments were analysed for different Bayesian classifiers: Naïve Bayes Multinomial (NBM), Naïve Bayes Multivariate (NBMv) and Naïve Bayes (NB). The accuracy obtained for the same is provided in Table 4 which shows that the extended form is much more effective for the categorization of Bangla news articles compared with the other two Bayesian classification algorithm. The experiments were also conducted on standard English datasets: Reuters-21578 and 20 Newsgroups, using the proposed method based on the default parameter (five-fold cross-validation) and obtained an average accuracy of 90.27% and 88.52%.
Accuracy for different Bayesian classifiers.
NBM: Naïve Bayes Multinomial; NB: Naïve Bayes.
Bold: Highest value.
The experiments were also performed for n cross-validation folds with n less than 5 (n being 2, 3 and 4) to see how the system performs in these scenarios and also making the system more robust. The obtained result is provided in Table 5.
Accuracy for n cross-validation folds with n being 2, 3 and 4 using hybrid feature and NBM.
NBM: Naïve Bayes Multinomial.
The recognition accuracy is obtained based on the equation provided in the following
The accuracy achieved for text-based feature, graph-based feature and the combination of these two features using NBM is provided in Figure 5 which shows that the combined features performed better than their individual application for classifying Bangla text documents.

Obtained accuracies for three approaches.
The confusion matrix obtained for text domains or categories after using NBM with five-fold cross-validation based on the hybrid feature extraction approach on 14,373 Bangla text documents is illustrated in Table 6. From the confusion matrix presented in Table 6, it can be seen that few text documents have misclassified to other text categories other than the category they belong to. One of the reasons behind this is the nature of the contents of the text documents. In the case of a few domains, the contents of the text documents have much more similarity to the contents of the text documents of other domains which leads to misclassification. The error was analysed and a part of a text document is provided as an example in Figure 6. Bangla is one of the most morphologically rich languages. An instance has been considered for ‘Science & Technology (ST)’ domain and from the confusion matrix, it can be seen that out of 1194 text documents in ‘Science & Technology’ category, 1170 documents were classified correctly while 9 documents were classified to ‘Business (B)’ domain, 12 documents to ‘Medical (M)’ domain and 3 documents to ‘State affairs (P)’ domain. The text document (Figure 6) considered as a sample to represent misclassification originally belongs to the ‘Science & Technology’ domain but it can be observed that the number of tokens related to the ‘Medical’ domain is quite high and hence the document was misclassified to ‘Medical’ category. In Figure 6, it can be seen that after pre-processing, out of 79 terms 39 terms (in red) are medical related words (that is around 50% of total tokens after pre-processing) which cannot be neglected in the experiment. The rate and percentage of misclassification obtained using hybrid feature selection approach on the dataset for all the text categories is provided in Table 7 where it can be observed that Business domain and State affairs domain have high misclassification rate due to the fact that the contents of Business and State affairs domains are quite relatable to each other.
Confusion matrix obtained using NBM based on hybrid feature.
NBM: Naïve Bayes Multinomial; SP: sports; ST: science and technology.
Bold: Highest value.

Example of misclassified text document.
Rate of misclassification of text documents obtained using hybrid feature and NBM.
NBM: Naïve Bayes Multinomial.
Result obtained for various metrics using hybrid feature and NBM.
NBM: Naïve Bayes Multinomial; SP: sports; ST: science and technology.
The experiment was also performed with a deep learning-based approach using LSTM-RNN which is suitable for sequential data. We obtained an accuracy of 95.43% using the hybrid feature. We also plan to use a deep learning-based approach in the feature extraction phase as well.
To test the robustness of our system, we experimented with different scenarios involving incomplete data. The four different scenarios are detailed in section whose result is presented in Table 9 below using NBM based on hybrid feature for five-fold cross-validation.
Results obtained for four different scenarios involving incomplete data.
We had further tested our system for a scenario where the system is trained by the original dataset consists of full texts but tested with incomplete texts and observed the performance for different incomplete test sets (
Result obtained by the four datasets for the train-test split.
5.1. Performances of popular classifiers
Experimentation was also tested with classifiers other than NBM, namely, Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Decision Tree (J48) and K Nearest Neighbour (KNN). The performance of classifiers for different evaluation metrics is shown in Table 11.
Performances of classifiers for different evaluation metrics.
NBM: Naïve Bayes Multinomial; SGD: Stochastic Gradient Descent; SVM: Support Vector Machine; J48: Decision Tree; KNN: K Nearest Neighbour.
Bold: Highest value.
5.2. Statistical significance test
The comparisons of the classification algorithms were performed using the Friedman Rank Sum test [32]. The reason behind its selection depends on its character of not relying on the proposed methodology for the specific task. Here, the dataset was divided into n number of sets and for each set, the accuracy was extracted using all the commonly used classifiers (k) being chosen for this experiment. Then, based on the accuracy being achieved for all the subset for a particular classification algorithm, the mean ranking was calculated to get the classifier with top rank. Here, in the experiment, k and n were 6 and 5 which means the database was segregated into five sets. The accuracy obtained by the classification algorithms on each of the sets was measured and rank
Accuracies and ranks for the classifiers.
NBM: Naïve Bayes Multinomial; SGD: Stochastic Gradient Descent; NB: Naïve Bayes; SVM: Support Vector Machine; J48: Decision Tree; KNN: K Nearest Neighbour.
The Friedman Statistic
Friedman test statistics.
This test confirms that compared with all the classification algorithms, the performance of NBM is better based on its mean rank 1.00 (Table 12) which is the highest rank followed by the lowest rank (6.00) being achieved by NB.
After rejecting the null hypothesis, the Nemenyi test [32] was used for comparing best and worst-performing classifiers depending on their average ranks diverging by at least the critical difference (CD)
For Nemenyi’s test, the value of q0.05 for 6 classification algorithms is 2.850 (see Table 5(a) of [32]). So, CD is 3.372 calculated using Equation (11). The calculated CD values for all the classifiers are given in Table 14. The significantly different values of mean ranks between the best (NBM) and worst-performing classifier (KNN) and CD show the difference between the performing abilities of NBM with all the remaining five classifiers. Hence, the statistical significance tests conclude that the Naïve Bayes Multinomial classification algorithm outperforms all other classifiers.
CD for Nemenyi’s test.
NBM: Naïve Bayes Multinomial; SGD: Stochastic Gradient Descent; NB: Naïve Bayes; SVM: Support Vector Machine; J48: Decision Tree; KNN: K Nearest Neighbour.
5.3. Performance of system on established datasets
5.3.1. Non-Indic language
To test the performance of the system and to prove the language independency of the proposed text categorization system, the experiments were also performed on two most popular and widely used datasets: Reuters-21578 R8 (7,674 documents) and 20 Newsgroups (18,828 documents). The performance of our system was also compared with the works proposed by Mahabal et al. [33], Tellez et al. [34], Jiang et al. [35] and Ko [36]. The accuracies obtained on Reuters-21578 (R8) and 20 Newsgroups (20NG) datasets are provided in Table 15. We observe that our system outperforms the existing systems on these two mentioned datasets.
Comparison of the proposed system with the existing systems on Reuters-21578 R8 and 20 Newsgroups dataset.
NG: newsgroups.
5.3.2. Indic language
Since there was no availability of Bangla text categorization dataset, we had to develop our own dataset. However, in 2018, Alam and Islam [20] made their dataset publicly available which consists of 3,76,226 documents with 8,81,34,695 tokens from the state (St), sports (Sp), economy (Ec), entertainment (En) and international (In) domains. Thus, the performance of the proposed system was tested on this dataset as well and the performance is enhanced from precision value of 0.960 to 0.977. They worked with tf-idf and neural network approach for categorization of documents. The confusion matrix is provided in Table 16.
Confusion matrix obtained on the existing Bangla dataset.
Bold: Highest value.
5.4. Performance of reported systems on our dataset
Comparisons with other researches being performed for text classification task by considering the feature selection approach as well as the classification algorithm adopted by the researchers in their work and an attempt was made to classify the text documents holistically like the approach proposed here. We have replicated the frameworks proposed by other researchers on our obtained dataset and evaluated the results. In each case, for comparison, the classifier for which the author gained maximum accuracy was selected. tf-idf weighting scheme was used by Islam et al. [19] to categorise the documents using SVM and obtained accuracies of 92.57%. Alam and Islam [20] utilised tf-idf in their work for extracting features and obtained 96% accuracy based on neural network as classifier. Hassan et al. [21] proposed a text classification system based on term-graph model for categorising texts from 12 domains. They obtained an average accuracy of 90.99% using KNN algorithm. The obtained accuracy on our dataset for all the methods is recorded in Table 17 given in the following. From the table, it can be clearly stated that the proposed method outperforms all other previously implemented techniques in terms of recognition accuracy. In every scenario, it can be seen that standard tf-idf has been used whose disadvantage is already been discussed in Section. Also, in work of Hassan et al. [21], position of a token was captured while calculating the similarity but they have not considered the relationship between two tokens based on the weighted edges.
Performance of reported systems on our dataset.
SVM: Support Vector Machine; KNN: K Nearest Neighbour.
6. Conclusion and future work
The experiments reported here show that among the three supervised feature selection methods ((a) text-based feature, (b) graph-based feature and (c) hybrid method involving two features); the hybrid feature performed better on 14,373 Bangla news articles from nine domains collected from various online news corpora. Also, from this comparative study, it is observed that among all the classification algorithms used in this experiment, the Naïve Bayes Multinomial outperforms the other commonly used classifiers. In the future, we also plan to investigate some standard dimensionality reduction schemes such as PCA and others on features extracted from texts. Also, we plan to make our dataset available publicly on request for research purposes only. There is a plan to test the system with a large number of text documents with an increasing number of text domains. It is possible to use more classifiers to test the system’s level of performance so that it can be applied for classifying text documents of other languages. We also plan to implement shallow and deep knowledge information and observe the performance of the system. We also plan to explore deep learning methods for developing an automatic Bangla text classification system.
