Tolerance-based granular methods: Foundations and applications in natural language processing

Abstract

Natural Language processing (NLP) derives its roots from artificial intelligence and computational linguistics. The proliferation of large-scale web corpora and social media data as well as advances in machine learning and deep learning have led to practical applications in diverse NLP areas such as machine translation, information extraction, named entity recognition (NER), text summarization and sentiment analysis. Named-entity recognition (NER), is a sub task of information extraction that seeks to discover and categorize specific entities such as nouns or relations in unstructured text. In this paper, we present a review of the foundations three tolerance-based granular computing methods (rough sets, fuzzy-rough sets and near sets) for representing structured (documents) and unstructured (linguistic entities) text. Applications of these methods are presented via semi-supervised and supervised learning algorithms in labelling relational facts from web corpora and sentiment classification (non-topic based text). The performance of the three presented algorithms is discussed in terms of bench marked datasets and algorithms. We make the case that tolerance relations provide an ideal framework for studying the concept of similarity for text-based applications. The aim of our work is to demonstrate that approximation structures viewed through the prism of tolerance have a great deal of fluidity and integrate conceptual structures at different levels of granularity thereby facilitating learning in the presented NLP applications.

Keywords

Granular computing rough sets near sets fuzzy rough sets named entity recognition semi-supervised learning tolerance classes tolerance approximation spaces sentiment classification

1. Introduction

Natural Language processing (NLP) derives its roots from artificial intelligence and computational linguistics. The proliferation of large-scale web corpora and social media data as well as advances in machine learning and deep learning have led to practical applications in diverse NLP areas such as machine translation, machine learning-based information extraction, named entity recognition (NER), text summarization, to name a few. Named-entity recognition (NER), is a sub task of information extraction that seeks to discover and categorize specific entities such as nouns or relations in unstructured text [1]. Opinion mining and sentiment analysis are two popular NLP research areas that where the opinions of users are analyzed to detect sentiment polarity for non topic-based text categorization [2, 3, 4, 5, 6]. Reasoning with a tolerance form of granular methods is particularly useful in natural language applications (NLP) with uncertain and overlapping class boundaries inherent in the structure of linguistic data. Fuzzy set theory [7], rough set theory [8], and near sets [9, 10] are core theories that constitute granular computing. These methods use information granules where a granule is a clump of objects (points) in the universe of discourse drawn together by indistinguishability, similarity, proximity, or functionality [7]. In this paper, we present foundations tolerance-based granular computing methods and its application in: named-entity recognition in structured (documents) and unstructured text (entities) as well as sentiment classification.

Poincaré’s work in 1905 [11] led to the idea of tolerance which can be traced back to his work on the similarity of sensations. Subsequently, tolerance relations were considered by Zeeman [12] in his study of the topology of the brain where he was concerned with different cells of the brain which, when they are very near, are perceived as identical or indistinguishable. Tolerance spaces as a framework for studying the concept of resemblance was presented in [13]. For natural language processing applications, classical rough set theory [8] based on equivalence relations is considered too restrictive. Tolerance relations provide the most general tool for studying indiscernibility phenomena [14]. Authors [15, 16, 14, 17] have worked on combining rough sets and tolerance relations [18] to leading to the tolerance rough sets (TRS) model.

Authors [19, 20, 21], proposed the idea of combining fuzzy and rough sets (FRS) to develop soft similarity classes i.e., fuzzifying the approximations of rough set theory. In fuzzy set theory, the relational counterpart generates soft similarity classes which permits partial overlap, even when the fuzzy relation is reflexive, symmetrical and $\mathcal{T}$ -transitive, i.e., a so called fuzzy $\mathcal{T}$ -equivalence relation [22]; a property that lies at the heart of fuzzy-rough set models which is ideal for text categorization applications [23]. In near set theory, instead of set approximation operators used by tolerance rough sets to generate tolerance classes, the tolerance classes are directly induced from the feature vectors using a tolerance level parameter $\varepsilon$ and a distance function. Tolerance near set-based (TNS) classification algorithm was first introduced in [24].

TRS for clustering structured text (documents) was introduced in [25] where a hierarchical clustering alogrithm based on an extension of the agglomerative method was applied. Non-hierarchical document clustering method using TRS was presented in [26] to address the issue of due to exploding time and space requirements of the hierarchical method especially when dealing with a large document corpora. A similar approach was used by the authors [27] where k-means clustering was applied to web documents represented using the TRS model. Web document classification with TRS was discussed in [28]. Online hot topic detection with TRS based clustering was presented in [29]. A tolerance-based semi-supervised two-class ensemble classifier for documents was introduced in [30]. In [31] document clustering using a lexicon-based document representation based on the TRS model was presented, as well a framework for retrieval of indonesian language text [32]. The problem of learning weights was addressed in [33]. TRS model was applied to semantic document indexing in [34, 35, 36]. An in-depth survey of application of tolerance rough sets in text categorization (structured and unstructured) can be found in [37]. Fuzzy rough set (FRS) method was proposed as a model for structured text representation (document) and retrieval [38]. The application of fuzzy rough sets to query refinement was first presented in [22] with the theory outlined in [39]. More recently, FRS was used in document classification by [40].

The growth of the Web has spawned knowledge-bases from web corpora such as Wikipedia, DBpedia, WordNet,1

¹
https://wordnet.princeton.edu/.

Never Ending learning (NELL) [41] and Google’s Knowledge Vault [42]. A knowledge base usually contains facts that are linguistic (unstructured) and referred to commonly as nouns (unary relations) or relations (binary relations). These facts are learned using in either a semi-supervised or unsupervised manner. The representation of facts is usually in the form of triple (subject, predicate, object). The predicate is referred to as a contextual pattern. With a semi-supervised form of learning which we present in this paper, three challenges are encountered: i) the number of training examples are few, i.e., relations and their known categories, ii) a relation may belong to more than one category depending on its contextual pattern(s), and iii) new relations are frequently mislabelled (also known as concept drift [43]) in the process of iterative learning. In our tolerance-based granular approach with TRS and FRS [44, 45, 46], we address these challenges by creating an approximation space for representing unstructured text in form of either crisp or graded co-occurrence matrix and designing semi-supervised learning algorithms, Tolerant Pattern Learner (TPL) and Fuzzy Pattern Learner (FRL) (Fig. 1 illustrates a typical steps followed during the iterative learning process for linguistic entities). We frame our discussion in terms of benchmarked NELL web corpora as well as two semi-supervised learning algorithms: Coupled Pattern Learner (CPL) [47] and Coupled Bayesian Sets (CBS) [48]. To address the problem of concept drift, both CPL and CBS employ external constraints. In CPL, simultaneous training of many extractors for learning relations using mutual exclusion and type checking relationships were employed. In CBS, a probabilistic score for every candidate noun fact belonging to a category was used. No external constraints were required for the TPL algorithm. However, mutual exclusion constraints were used with the FRL algorithm.

Figure 1.

Semi-supervised Learning in NER with granular methods [49]. Steps 1–5 shows the iterative process of enlarging the trusted entity set (nouns) using a promotion mechanism with approximation spaces.

Non topic-based text classification includes sentiment classification and news categorization. We discuss our recent work on near-set technique-based algorithm (TSC) to classify sentiment polarities of tweets and news classification [50]. The supervised TSC algorithm takes advantage of the recent advances in efficient feature extraction and vector generation from pre-trained bi-directional transformer encoders for creating tolerance classes. Feature vectors are generated using two pre-trained deep learning models BERT [51] and SBERT [52] and cosine distance is used for generating tolerance classes and representative vectors during the training process. The effect of different vector-generation methods, tolerance class sizes, balanced and imbalanced curated datasets as well as a number of sentiment classes on the TSC algorithm are examined.

The outline of the paper is as follows: In Sec. 2, we present formal definitions and models for TRS, FRS and TNS methods for structured text (documents), unstructured text (linguistic entities) and non-topic based text (sentiments). We define approximation spaces as representational structures for applying machine learning algorithms. In Sec. 3 we present two semi supervised learning approaches in the context of an NER application for categorizing nouns (with TPL) and relations (with FRL). In Sec. 4, we discuss a near set-based algorithm (TSC) to classify sentiments of tweets and news categories from text. In Sec. 5, we briefly describe the datasets used in the TPL, FRL and TSC experiments. Results of the experiments using NER and non-topic classification datasets are discussed in Sec. 6 followed by concluding remarks in Sec. 7.

2. Preliminaries

2.1 Tolerance rough sets

In classical rough sets theory, a universe of objects is partitioned into indiscernible classes (i.e., non-overlapping granules) by means of an equivalence relations ( $R$ ) which are reflexive, symmetric and transitive. Specifically, for NLP applications where overlapping classes are desired, a non-transitive binary relation that is reflexive and symmetric is necessary. Hence the need for a tolerance form of rough sets. Tolerance relations are reflexive and symmetric but they are not necessarily transitive, so the classes induced by such relations can overlap. Let $U$ be a finite, non-empty universe of objects, $\mathbb{P}(U)$ be the power set of $U$ , and let $\mathcal{T}:U\to\mathbb{P}(U)$ be a binary relation such that $x\mathcal{T}y\iff y\in\mathcal{T}(x)$ holds for any $x,y\in U$ . $\mathcal{T}$ defines a tolerance relation and $\mathcal{T}(x)$ defines a tolerance class of $x$ . Then a tolerance membership function (also known as vague inclusion function) $\nu:\mathbb{P}(U)\times\mathbb{P}(U)\to[0,1]$ is defined as $\nu(X,Y)=\frac{|X\cap Y|}{|X|}$ . The lower and upper approximations of set $X$ can be defined as

$\displaystyle\mathcal{L}_{\mathcal{A_{\mathcal{T}}}}(X)=\left\{x\in U:\frac{|% \mathcal{T}(x)\cap X|}{|\mathcal{T}(x)|}=1\right\},$ $\displaystyle\mathcal{U}_{\mathcal{A_{\mathcal{T}}}}(X)=\left\{x\in U:\frac{|% \mathcal{T}(x)\cap X|}{|\mathcal{T}(x)|}>0\right\},$

where $\mathcal{A_{\mathcal{T}}}$ defines a simplified form of tolerance approximation space $\mathcal{A}=(U,\mathcal{T},\nu)$ [53] which leads to the the definition of a tolerance rough set:

(Tolerance Rough Set).

A tolerance rough set is defined as a pair $(\mathcal{L}_{\mathcal{A_{\mathcal{T}}}}(X),\mathcal{U}_{\mathcal{A_{\mathcal{% T}}}}(X))$ .

2.1.1 Document representation with TRS

Documents are considered as structured information and were first modeled using tolerance rough sets by Kawasaki [25]. With this model, documents are represented as vectors, where each vector dimension corresponds to a term weight that is enhanced by the two approximation operators defined (see Def. 1), by relating terms across documents. This is useful particularly when each document is characterized by only a small number of terms along with many zero-valued entries in a high dimensional term vector space.

Let $D=\{d_{1},d_{2},\ldots,d_{M}\}$ be a set of documents and and let $T=\{t_{1},t_{2},\ldots,t_{N}\}$ be a universe of index terms that occur in those documents. Let each document $d_{j}$ be represented as a weight vector of its terms $\hat{d_{j}}=\langle(t_{1},w_{1j}),(t_{2},w_{2j}),\ldots,(t_{N},w_{Nj})\rangle$ where $w_{ij}\in[0,1]$ shows the significance of term $i$ in document $j$ . Let a query $Q$ be a cluster representative in the form of $\hat{Q}=\langle(q_{1},w_{1q}),(q_{2},w_{2q}),\ldots,(q_{s},w_{sq})\rangle$ where $q_{i}\in T$ and $w_{iq}\in[0,1]$ [26]. The goal is to find ordered documents $d_{j}\in D$ that are relevant to $Q$ .

(Document Approximation Space).

Let $\mathcal{A_{\textit{DOC}}}=(T,I,\nu,P)$ be the approximation space for approximating documents, where $I\subseteq T\times T$ is the tolerance relation and $\nu:\mathbb{P}(T)\times\mathbb{P}(T)\to[0,1]$ is the inclusion function. The tolerance relation $I$ is defined as:

$\displaystyle I_{\theta}(t_{i})=\{t_{j}|f_{D}(t_{i},t_{j})\geqslant\theta\}% \cup\{t_{i}\}$

where $\theta$ is a user-defined threshold and $f_{D}(t_{i},t_{j})$ denotes the number of terms in which $t_{i}$ and $t_{j}$ co-occur. The inclusion function is defined as:

$\displaystyle\nu(I_{\theta}(t_{i}),X)=\frac{|I_{\theta}(t_{i})\cap X|}{|I_{% \theta}(t_{i})|}$

where $X$ represents the concept (documents) to be approximated and the structurality function $P$ is 1 for $\mathcal{A_{\textit{DOC}}}$ . The lower and upper approximations of $X$ are redefined as follows:

$\displaystyle\mathcal{L}_{\mathcal{A_{\textit{DOC}}}}(X)=\left\{t_{i}\in T:% \frac{|I_{\theta}(t_{i})\cap X|}{|I_{\theta}(t_{i})|}=1\right\},$ $\displaystyle\mathcal{U}_{\mathcal{A_{\textit{DOC}}}}(X)=\left\{t_{i}\in T:% \frac{|I_{\theta}(t_{i})\cap X|}{|I_{\theta}(t_{i})|}>0\right\}.$

.

The uncertainty function defines the tolerance class for each index term and, in effect, captures affinity amongst the terms via the co-occurrence function $f_{D}$ .

.

The inclusion function also called vague inclusion function, measures the overlap between tolerance class (containing $t_{i}$ ) and set $X$ .

.

The inclusion function also called vague inclusion function, measures the overlap between tolerance class (containing $t_{i}$ ) and set $X$ .

Weight Adjustment via Tolerance Rough Sets:

The following weighting scheme takes into account boundary terms and assigns non-zero weights. Note that the upper approximation of a document $\mathcal{U}_{\mathcal{A_{\textit{DOC}}}}(d_{j})$ covers the tolerant terms for all of its own terms as well, creating the enriched representation.

$\displaystyle w_{ij}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}trs }}=\left\{\begin{array}[]{ll}(1+\log(f_{d_{j}}(t_{i})))\times\log\frac{M}{f_{D% }(t_{i})},&\mbox{if }t_{i}\in d_{j},\\ \min_{t_{h}\in d_{j}}w_{hj}\times\frac{\log(M/f_{D}(t_{i}))}{1+\log(M/f_{D}(t_% {i}))},&\mbox{if }t_{i}\in\mathcal{U}_{\mathcal{A_{\textit{DOC}}}}(d_{j})% \backslash d_{j},\\ 0,&\mbox{if }t_{i}\notin\mathcal{U}_{\mathcal{A}}(d_{j}).\end{array}\right.$

.

Once the weights are adjusted within the framework of tolerance rough sets, one can measure the similarity between a query vector by using the following formula which is typically used in retrieval (by clustering similar documents).

$\displaystyle\mbox{Similarity}(Q,d_{j})=\frac{2\times\sum_{k=1}^{N}(w_{kq}% \times w_{kj})}{\sum_{k=1}^{N}w^{2}_{kq}+\sum_{k=1}^{N}w^{2}_{kj}},$

2.1.2 Named entity representation with TRS

In contrast to document representation and learning, named entity identification consists of recognizing common linguistic entities (in unstructured text), and relationships between them as well as categorizing them into predefined categories such as names, location, organization. These entities are meaningful terms (nouns) or phrases (relations) in unstructured text. In this section, we give formal definitions for approximating entities (nouns). Definitions for relations can be found in [44]. Let $\mathcal{N}=\{n_{1},n_{2},\ldots,n_{M}\}$ represent the universe of nouns, and let $\mathcal{C}=\{c_{1},c_{2},\ldots,c_{P}\}$ represent the universe of contextual patterns. Let $C:\mathcal{N}\to\mathbb{P}(\mathcal{C})$ map each noun to its set of co-occurring contexts: $C(n_{i})=\{c_{j}:f_{\mathcal{N}}(n_{i},c_{j})>0\}$ where $f_{\mathcal{N}}(n_{i},c_{j})=\{\kappa\in\mathbb{N}:n_{i}$ occurs $\kappa$ times within context $c_{j}\}$ . Let $N:\mathcal{C}\to\mathbb{P}(\mathcal{N})$ map each context to its set of co-occurring nouns: $N(c_{j})=\{n_{i}:f_{\mathcal{N}}(n_{i},c_{j})>0\}$ .

(TRS Approximation Space for nouns [54]).

Let $\mathcal{A_{\textit{NER}}}=(\mathcal{C},\mathcal{N},I,\omega,\nu)$ where $\mathcal{N}$ and $\mathcal{C}$ are as defined previously. $I=I_{\theta}$ is the parametrized uncertainty function describing the tolerance classes for the contexts, in terms of contextual overlaps:

$\displaystyle I_{\theta}(c_{i})=\{c_{j}:\omega(N(c_{i}),N(c_{j}))\geqslant% \theta\}.$

Here, $\theta$ is the tolerance threshold and $\omega$ is the overlap index which is the Sorensen-Dice index [55]:

$\displaystyle\omega(A,B)=\frac{2|A\cap B|}{|A|+|B|}.$ (1)

The degree of inclusion is measured by $\nu:\mathbb{P}(\mathcal{C})\times\mathbb{P}(\mathcal{C})\to[0,1]$ and is defined similar to the the aproximation space for documents. Within the framework of $\mathcal{A_{\textit{NER}}}$ , the lower and upper approximation of a noun $n_{i}$ is defined as:

$\displaystyle\mathcal{L}_{\mathcal{A_{\textit{NER}}}}(n_{i})=\{c_{j}\in% \mathcal{C}:\nu(I_{\theta}(c_{j}),C(n_{i}))=1\},$

and the upper approximation is defined as:

$\displaystyle\mathcal{U}_{\mathcal{A_{\textit{NER}}}}(n_{i})=\{c_{j}\in% \mathcal{C}:\nu(I_{\theta}(c_{j}),C(n_{i}))>0\}.$

.

A noun (concept) can now be approximated using its contexts. The lower approximation represents closely related contexts while the upper approximation represents somewhat related contexts.

.

Common entities (nouns) from web corpora [54]

•

nouns: Ice Hockey, Canada, Microsoft

•

Categories: Sport, Country, Company

•

Contexts: “_ league”, “_ and other sports”, “_ is popular in _”

•

Co-occurrence statistics: e.g. $f$ (“Ice Hockey”, “_ league”) $=n$

.

Biomedical entities from biomedical literature [56]

•

entities: sars, saliva, immunohistochemistry

•

Categories: coronavirus, substrate, diagnostics

•

Contexts: “respiratory syndrome-associated_”, “binding of _”, “virus replication in _”

•

Co-occurrence statistics: e.g. $f$ (“coronavirus”, “respiratory syndrome-associated_”) $=145$

2.2 Named entity representation with FRS

A fuzzy rough set is a pair $(A_{1},A_{2})\in(X,R)$ where A is a fuzzy set in X such that $R\downarrow A=A_{1}$ and $R\uparrow A=A_{2}$ . $R$ is a fuzzy relation in $X$ [39] where $\downarrow$ represents the lower approximator and $\uparrow$ represents the upper approximator.

A knowledge base can be viewed as a thesaurus consisting of linguistic entities such as nouns, relations, and their contexts (or contextual patterns). We now define give formal definitions for relation entities. Let $\mathcal{H}=\{h_{ij}=(n_{i},n_{j})\in{\mathcal{N}^{2}}:\exists{r_{k}}\in% \mathcal{R}|f_{\mathcal{H}}(h_{ij},r_{k})>0\}$ represent the universe of relations described via the relational co-occurrence function $f_{\mathcal{H}}(h_{ij},r_{k})$ where $f_{\mathcal{H}}(h_{ij},r_{k})=\{k\in\mathbb{N}:h_{ij}$ occurs $k$ times within the context $r_{k}\}$ and $n_{i}$ is a noun and $r_{k}$ is a relational context. Let $R:\mathcal{H}\to\mathbb{P}(\mathcal{R})$ represent the mapping of each relation to its set of co-occurring relational contexts: $R(h_{ij})=\{r_{k}:f_{R}(h_{ij},r_{k})>0\}$ .

(Trusted relations).

Let $\mathcal{TR}=\{h_{ab},h_{bc},\ldots,h_{cd}\}$ represent a set of trusted relations such that $\mathcal{TR}\subset\mathcal{H}$ and index ${cd}<{ij}$ .

(Co-occurrence).

Let $R:\mathcal{H}\to\mathbb{P}(\mathcal{R})$ denote mapping of each relation to its set of co-occurring relational contexts: $R(h_{ij})=\{r_{k}:f_{R}(h_{ij},r_{k})>0\}$

.

Common entities (relations) from web corpora [45]

•
relations: Tiger Woods $\|$ United States, Winnipeg $\|$ Canada, Satya Nadella $\|$ Microsoft
•
Categories: City_Country, Athlete_Team, CEO_Company
•
Contexts: “attacking midfielder”, “_prodigy”, “_ have confirmed_”

.

The cross mapping function forms the basis for the co-occurrence matrix. Function $f_{R}(h_{ij},r_{k})$ is used in the thesaurus construction process.

(FRS Approximation Space for relations [54]).

Let $A_{\textit{FRS}}=(\mathcal{H},\mathcal{R},CO_{F})$ represent the approximation space used for thesaurus contruction. The fuzzy relation $CO_{F}$ is defined as a fuzzy set in $\mathcal{H}\times\mathcal{R}$ . Let $\mathcal{H_{F}},\mathcal{TR_{F}}$ represent the fuzzy sets of relations and trusted relations respectively. The upper and lower approximations of the fuzzy set $\mathcal{H_{F}}$ in $A_{\textit{FRS}}$ is denoted by $\mathcal{H_{F}}\uparrow CO_{F}$ and $\mathcal{H_{F}}\downarrow CO_{F}$ [22, 57, 58].

.

The first step towards creation of a fuzzy thesaurus is by normalizing the co-occurrence information using $\vartheta$ . The next step involves fuzzifying the co-occurrence function with the S-function where $\alpha$ and $\beta$ are constants determined experimentally.

(Fuzzifying the co-occurrence function [22, 59]).

$\displaystyle\vartheta(h_{ij},r_{k})=\frac{{f_{R}(h_{ij},r_{k})}}{{f_{R}(h_{ij% },r_{k}),\forall k:1..Q}}$ (2) $\displaystyle S(\vartheta;\alpha,\beta)=\left\{\begin{array}[]{ll}1&\text{if }% \vartheta\geqslant\beta\\ \frac{\vartheta-\alpha}{\beta-\alpha}&\text{if }0.005\leqslant\vartheta% \textless\beta\\ 0,&\text{otherwise}\\ \end{array}\right.$ (3)

.

Fuzzifying co-occurrence $CO_{F}$

Table 1
Normalized frequency $\vartheta$ for relation FC Barcelona $\|$ Lionel Messi

FC Barcelona $\|$ Lionel Messi Normalized frequency $\vartheta$

Attacking midfielder 1/7

Forward 1/7

Have handed 2/7

Player 2/7

Prodigy 1/7

Table 2
S-function value for relation FC Barcelona $\|$ Lionel Messi

FC Barcelona $\|$ Lionel Messi S-function

Attacking midfielder 1

Forward 1

Have handed 1

Player 1

Prodigy 1

(Upper approximation).

$\displaystyle\mathcal{H_{F}}\uparrow CO_{F}=\mathop{\mbox{sup}}\limits_{{h_{ij% }\in\mathcal{H}},{h_{xy}\in\mathcal{TR}}}(CO_{F}(R(h_{ij}),h_{xy}),\mathcal{H_% {F}}(h_{xy}):CO_{\mathcal{F}}({h_{ij}})\geqslant CO_{\mathcal{F}}({h_{xy}}))$ (4)

.

The upper approximation makes it possible to select candidate relations $h_{ij}$ having a membership co-occurrence value either equal to or more than that of the trusted relations $h_{xy}$ .

(Lower approximation).

$\displaystyle\mathcal{H_{F}}\downarrow CO_{F}=\mathop{\mbox{inf}}\limits_{h_{% ij}\in\mathcal{H},\ {h_{xy}\in\mathcal{TR}}}(CO_{F}(R(h_{ij}),h_{xy}),\mathcal% {H_{F}}(h_{xy})$ (5) $\displaystyle:((h_{ij},h_{xy})|R({h_{xy}})\cap R({h_{ij}})\neq\emptyset))$

.

The lower approximation will makes it possible to select candidate relations $h_{ij}$ as trusted when there is at least one common context with a trusted noun $h_{xy}$ .

.

To counter the problem where candidate relations with high membership degrees are promoted and added to the set of trusted relations, irrespective of whether the candidate relation is related to the trusted relation, the upper approximation operation of all candidate relations $h_{ij}$ followed by lower approximation for the remaining candidate relations is performed. This operation is called tight upper approximation and is defined in Eq. (6).

(Tight Upper approximation [22]).

$\displaystyle CO_{F}\downarrow\uparrow{\mathcal{H}_{F}}(h_{ij})=CO_{F}% \downarrow(CO_{F}\uparrow{\mathcal{H}_{F}}(h_{ij}))$ (6)

.

The micro score for a candidate relation $h_{ij}$ (see Eq. (7)) is calculated using the tight approximation operator. This will score will be used during the learning process.

(micro score).

$\displaystyle\textit{micro}(h_{ij})=\omega_{1}(\mathcal{H_{F}}\uparrow CO_{F})% +\omega_{2}(\mathcal{H_{F}}\downarrow CO_{F}))$ (7)

where $\omega_{1}$ and $\omega_{2}$ are application dependent.
2.2.1 Tolerance near sets in sentiment classification

FC Barcelona $\\|$ Lionel Messi	Normalized frequency $\vartheta$
Attacking midfielder	1/7
Forward	1/7
Have handed	2/7
Player	2/7
Prodigy	1/7

FC Barcelona $\\|$ Lionel Messi	S-function
Attacking midfielder	1
Forward	1
Have handed	1
Player	1
Prodigy	1

Near sets are essentially disjoint sets that contain objects with similar descriptions provided the intersection of the sets is nonempty. The basic structure which underlies near set theory is a perceptual system which consists of perceptual objects [10]. Near sets are characterized by a perceptual system, a nearness relation and a near set [60].

.

Perceptual System [10]

A perceptual system is a pair $\langle O,F\rangle$ , where $O$ is a nonempty set of perceptual objects and $F$ is a countable set of real-valued probe functions $\phi_{i}:O\to\mathbb{R}$ where $\phi_{i}:O\rightarrow\mathbb{R}$ is a probe function of a single feature.

The probe functions give rise to a number of perceptual relations. The notion of tolerance is directly related to the idea of closeness between objects, that resemble each other with a tolerable level of difference. The tolerance space for a text-based nearness relation consists of textual objects where the universe consisting of text $T$ described by set of vectors $\Phi$ .

.

Text-based Tolerance Relation $\cong_{\mathcal{T},\epsilon}$ [61]

Let $\langle T,F\rangle$ be a universe of nonempty set of objects $T$ and $F$ be the feature set. Let $\mathcal{T}\subseteq F$ where $\mathcal{T}$ represents textual features. A tolerance space $\langle T,\cong_{\mathcal{T},\epsilon}\rangle$ is defined as:

$\displaystyle\cong_{\mathcal{T},\epsilon}=\{(t_{i},t_{j})\in T\times T:\textit% {dist}(t_{i},t_{j})\leqslant\varepsilon\}$ (8)

where dist is the cosine distance given in Eq. (8). The tolerance relation $\cong_{\mathcal{T},\epsilon}$ induces a tolerance class $T C$ where $\varepsilon$ is a user-defined tolerance level.

$\displaystyle\textit{dist}(t_{i},t_{j})=1-\frac{{\phi(t_{i}).\phi(t_{j})}}{{% \left\|{\phi(t_{i})}\right\|\left\|{\phi(t_{j})}\right\|}}$ (9)

.

In other words, given a set of text (e.g., tweets) $T$ , where $t_{i}\in T$ , $i\in N$ , each tweet or text $t_{i}$ can be represented as a k-dimensional word vector $\phi(t_{i})$ where text similarity is measured using the cosine distance measure. The universe of text $T$ described by set of vectors $\phi$ , is spread amongst tolerance classes with a tolerance level $\varepsilon$ for semantic textual similarity, where a tolerance class $T C$ is maximal with respect to inclusion [62].

.

Let $T=\{t_{0},t_{1},t_{2},t_{3},t_{4}\}$ is set of tweets elaborated as: $T=$ {“So there is no way for me to plug it in here in the US unless I go by a converter.”, “Good case, Excellent value.”, “Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!”, “He was very impressed when going from the original battery to the extended battery.”, “The design is very odd, as the ear clip is not very comfortable at all.”}

.

Let $\Phi=$ {[7.22507477e-01, 1.02917385e+00, …, $-$ 6.84888303e-01, 2.86179781e-01], [ $-$ 2.96921253e-01, 7.52622932e-02, …, $-$ 4.93599415e-01, $-$ 1.82238102e-01], [1.55841690e-02, 7.04942644e-02, …, 3.76997262e-01, $-$ 2.59568900e-01], [7.95753375e-02, $-$ 2.57343829e-01, …, 6.54604957e-02, $-$ 2.85186619e-01 ], [ 4.41945344e-01, 5.39983749e-01, …, $-$ 4.36222434e-01, 1.82313472e-01]} where $\Phi$ is a set of 768 dimensional vectors extracted using SBERT.

.

Figure 2 shows the cosine distance matrix created by using Eq. (8).

Figure 2.

Distance matrix with cosine distance values between tweets $t_{0},t_{1},t_{2},t_{3},t_{4}$ .

.

Based on the distance matrix shown in Fig. 2, and $\varepsilon\leqslant$ 0.79, we can derive the following following tolerance classes:

$\textit{dist}(t_{0},t_{4})=0.56$ $\textit{dist}(t_{4},t_{2})=0.65$ $\textit{dist}(t_{0},t_{2})=0.77$ since all of the pairs satisfy $\varepsilon\leqslant$ 0.79 which creates the first tolerance class $TC_{0}$ . $TC_{0}=\{t_{0},t_{2},t_{4}\}$ . $\textit{dist}(t_{1},t_{3})=0.32$ $TC_{1}=\{t_{1},t_{3}\}$ which creates the second tolerance class. $\textit{dist}(t_{2},t_{4})=0.65$ $\textit{dist}(t_{2},t_{3})=0.74$ $\textit{dist}(t_{4},t_{3})=0.786$ $TC_{2}=\{t_{2},t_{3},t_{4}\}$ which creates the third tolerance class. The final tolerance class is union of all above classes $TC_{0},TC_{1},TC_{2}$ . $TC=\{\{t_{0},t_{2},t_{4}\},\{t_{1},t_{3}\},\{t_{2},t_{3},t_{4}\}\}$

3. Method I – Named entity recognition application

3.1 Semi-supervised learning with TRS and FRS

In this section, we give an overview of the TRS-based pattern learner (TPL) algorithm for categorizing nouns with the representation described in sec. 2.1.2. TPL follows an iterative method and categorizes nouns using a semi-supervised approach. In this approach, given a trusted noun $n_{i}$ for a category and a candidate noun $n_{j}$ , three approximations are computed: $C(n_{i})$ , $\mathcal{L}_{\mathcal{A_{\textit{NER}}}}(n_{i})$ , $\mathcal{L}_{\mathcal{A_{\textit{NER}}}}(n_{i})$ shown in blue, white and orange regions respectively in Fig. 3. Each region represents a degree of similarity for a candidate noun $n_{j}$ which will be used to determine an overall score for the promotion of $n_{j}$ .

Figure 3.

Zones of Approximation induced by trusted entity (noun ( $n_{i}$ )) and candidate entity (noun ( $n_{j}$ )) and contexts $C(n_{i})$ and $C(n_{j})$ respectively for a certain category.

.

In other words, a trusted noun $n_{i}$ has its universe of contexts partitioned by its three descriptors (context, upper and lower). For example, the degree of similarity is highest in zone 1 (orange) since it is covered by lower approximation of the trusted noun and the context of the candidate noun. These zones will be used in the promotion scoring defined below.

(Promotion scoring).

$\displaystyle\textit{micro}(n_{i},n_{j})=\omega(C(n_{i}),C(n_{j}))\alpha+% \omega(\mathcal{U}_{\mathcal{A_{\textit{NER}}}}(n_{i}),C(n_{j}))\beta+\omega(% \mathcal{L}_{\mathcal{A_{\textit{NER}}}}(n_{i}),C(n_{j}))\gamma.$ $\displaystyle\textit{macro}_{\textit{cat}}(n_{j})=\sum\limits_{\forall n_{i}% \in\textit{Trusted}_{cat}}^{n}\textit{micro}(n_{i},n_{j}).$

.

A micro score is calculated for each category. $\alpha$ , $\beta$ and $\gamma$ are weights that reflect the contribution of each zone and are determined by the application domain. A macro score is calculated over all categories. The overlap index function $\omega$ given in Eq. (1) is used for this calculation.

[h!] TPL for categorizing nouns [54]

InputInput OutputOutput

A large unlabelled corpus $U$ , trusted nouns, categories and the co-occurrence matrix. Promoted nouns for each category. These are labelled nouns. $r=1\rightarrow\infty$ each category cateach new trusted noun $n_{i}$ of cat Calculate the approximations $\mathcal{L}_{\mathcal{A_{\textit{NER}}}}(n_{i})$ and $\mathcal{U}_{\mathcal{A_{\textit{NER}}}}(n_{i})$ each candidate noun phrase $n_{j}$ Calculate $\textit{micro}(n_{i},n_{j})$ each candidate noun $n_{j}$ $\textit{macro}_{\textit{cat}}(n_{j})=\sum\limits_{\forall n_{i}\in\textit{cat}% }\textit{micro}(n_{i},n_{j})$ Rank nouns by $\textit{macro}_{\textit{cat}}/|\textit{cat}|$ Promote top nouns as trusted

The input to the TPL algorithm (Alg. 13) is an ontology comprised of a set of categories (e.g. City) and a handful of trusted nouns also called as seeds (e.g., Winnipeg, New Delhi, Ankara). Additionally, a large co-occurrence matrix representing the nouns and the contextual patterns are input to the TPL algorithm. The output consists of promoted trusted nouns assigned to their respective categories adding to the growing knowledge base of trusted (labelled) nouns. Lines 6 and 8 show the computation of the scores. Once the score is calculated for every candidate noun of cat, the candidates are ranked by their macro-scores (normalized by the number of trusted instances of cat) shown in line 8. The top new candidates are promoted as trusted nouns shown in line 9. Two versions of the TPL were designed. TPL 1.0 was the first implementation of TRS in a NER application to categorize nouns and relations. The motivation for the development of TPL 2.0 was to explore the scalability (with a larger ontology) of tolerant pattern learner and to test whether any external constraints were necessary to overcome concept drift.

The FRL algorithm is also a semi-supervised algorithm which several features common with TPL in terms of the process of learning and promotion. In contrast to the TPL algorithm, Alg. 13 gives the categorization of relations (pair of nouns) with the representation described in Sec. 2.2. The input to FRL is an ontology of trusted relations (seeds), categories, unlabeled relational facts, and a co-occurrence matrix. The process consists of building a fuzzy thesaurus (line 5), applying the upper and lower approximator operators (resulting in tight upper approximation) are shown in steps 6–10.

[h!] FRL for categorizing relations [59]

InputInput OutputOutput

An ontology defining categories; a large corpus $\mathcal{H}$ , $C O$ co-occurrence matrix, a small set of trusted relations $\mathcal{TR}$ $\mathcal{TR^{\prime}}$ is a set of all new promoted trusted relations $r=1\rightarrow\text{end of file}$ each category cateach new trusted relations $h_{xy}$ belonging to cateach candidate relation $h_{ij}$ Calculate Fuzzy Relation $\mathcal{CO_{F}}$ using Eqs (2) and (3) Calculate Upper Approximation $\mathcal{H_{F}}\uparrow CO_{F}$ as in Eq. (4) Calculate score $\omega_{1}$ each candidate relation $h_{ij}$ Calculate Lower Approximation $\mathcal{H_{F}}\downarrow CO_{F}$ using Eq. (9) Calculate score $\omega_{2}$ Calculate $\textit{micro}_{\textit{cat}}(h_{ij})$ using Eq. (7) Sort trusted instances $h_{xy}$ by $\textit{micro}_{\textit{cat}}/|\textit{cat}|$ Promote top trusted instances, such that $\mathcal{TR^{\prime}}=\mathcal{TR}\cup\{h_{xy}\}$

4. Method II – Sentiment polarity classification

In this section, we present a near set-based algorithm (TSC) to classify sentiments of tweets and news categories from text. The TSC classification algorithm has two parts: training and testing. Alg. 4 describes the training process where a tolerance class representative vector is generated.

[!h] TSC Training Phase: Generating tolerance class representative vectors [61]InputInput OutputOutput $TV=\{TV_{1},\dots,TV_{M}\}$ , // Transformer Vectors for training $\varepsilon>0,$ // Tolerance level parameter $(NT,\{(R_{1},\textit{TextCat}_{1}),\dots,(R_{NT},\textit{TextCat}_{NT})\})$ NT is the size of the Tolerance class set

// Create a distance matrix for all training vectors based on the function defined in Eq. (9)

$p\leftarrow 1$ $M$ $q\leftarrow p+1$ $M$ $\mbox{computeCosineDist}(TV_{p},TV_{q},{\textit{Cos}}_{pq})$ //a. Create object pairs satisfying the tolerance relation defined in Eq. (8) b. Create a neigbhourhood of training vectors. c. GenerateToleranceClass Set $T$ where $TC_{i}$ is one tolerance class member induced by the tolerance relationd. Determine the majority polarity element in each $T_{i}$ e. Compute representative class vectors

$i\leftarrow 1$ $M$ $j\leftarrow i+1$ $M$ $\mbox{ObjectPairs}\leftarrow\mbox{generatetolerantpairs}(\mbox{Cos}_{ij},\varepsilon)$ ; $N_{i}\leftarrow\mbox{createobjectneighbour}(\mbox{ObjectPairs},i,TV)$ ; //Compute the neighbourhood $N_{i}$ of $i^{\text{th}}$ training vector $TV_{i}$

all, $o_{1},o_{2}\in N_{i}$ $o_{1},o_{2}\in\mbox{ObjectPairs}$ $TC_{i}\leftarrow\{o_{2}\}$ ; // Include $o_{2}$ from $N_{i}$ into $TC_{i}$ $T\leftarrow T\cup\{TC_{i}\}$ ; //T is set of all tolerance classes $\textit{TextCat}_{i}\leftarrow\mbox{computeMajorityPol}(T_{i})$ ; //For each $T_{i}$ , determine polarity/category by majority voting

$NT\leftarrow|T|$ ; //Number of tolerance classes in T //End of the process of generating set $T$ which is set of all tolerance classes.//For each Tolerance class in $T$ , generate a tolerance representative class vector and its polarity/category. $\{(R_{1},\textit{TextCat}_{1}),\dots,(R_{NT},\textit{TextCat}_{NT})\}% \leftarrow\mbox{GenerateClassRepresentative}(NT)$ ;

Given a tolerance level $\varepsilon$ , tolerance classes are induced from the training set vectors using the cosine distance, and the representative of each tolerance class is computed as the mean value of the feature vector. The polarity (or category) of the representative vector is determined based on majority voting. In the classification phase shown in Alg. 4, TSC uses the representative tolerance class vectors generated in the training phase and their associated polarity/text category. The computeCosineDist function (line 3) calculates the cosine distance between each test set vector and all the representative tolerance class vectors. The DeterminePolarity function (line 4) chooses the representative tolerance class that is closest to the test set vector and assigns the polarity of the representative to the test set vector.

[h!] TSC Classification Phase: Assigning Sentiment Polarities [61]InputInput OutputOutput $\varepsilon>0$ , //Tolerance level parameter, $N T$ //Size of the Tolerance class set $T$ , $TV^{\prime}=\{TV^{\prime}_{1},\dots,TV^{\prime}_{M}\}$ , //Transformer Vectors for testing

$\{(R_{1},\textit{TextCat}_{1}),\dots,(R_{NT},\textit{TextCat}_{NT})\}$ //Representative class vectors generated in the training phase and their associated polarities

$(TV^{\prime}=\{(TV^{\prime}_{1},\textit{TextCat}_{1}),\dots,(TV^{\prime}_{M},% \textit{TextCat}_{M})\})$ //Transformer Vectors with assigned polarities/categories

$i\leftarrow 1$ $M$ $j\leftarrow i+1$ $N T$ $\mbox{computeCosineDist}(TV^{\prime}_{i},R_{j},\mbox{Cos}_{ij})$ ; $TV^{\prime}\leftarrow\mbox{DeterminePolarity}(\textit{Cos}_{ij})$ //Computes min. distance and assigns polarity to the test set vector

5. Materials

5.1 NER dataset for TPL 1.0. TPL 2.0 and FRL experiments

The original source for TPL 1.0 and FRL experiments is ClueWeb092

²
https://lemurproject.org/clueweb09/.

which is a large web document collection consisting of roughly 1 billion web pages in ten languages, collected in early 2009. NLP methods were used to process the web pages from CluWeb09 consisting of millions of raw noun phrases, contextual patterns and all the co-occurrence information. The authors [48] shared the all-pairs dataset from which 68,919 unique nouns and 59,325 unique contextual patterns were extracted for TPL 1.0 experiments. For the TPL 1.0 experiments with relations, 13,020,010 unique noun phrase pairs and 11,424,413 unique contextual patterns were used. For TPL 2.0 for categorizing nouns, more than 900 million sentences from ClueWeb123

https://lemurproject.org/clueweb12/specs.php.

were extracted by us to derive an enlarged set of 130,536 nouns and 118,648 contextual patterns. Preprocessing details of the ClueWeb12 dataset for TPL 2.0 experiments can be found in [46]. ClueWeb12 is a collection of about 733,019,372 English web pages (approximate ly 6TB of compressed data) in Web Archive format (WARC).4

⁴

https://www.iso.org/standard/44717.html.

5.2 Text dataset

A subset of ten curated benchmark datasets (shown in Table 3) consisting of a mix of long and short words with varying number and sizes of sentiment classes (positive, negative, neutral and irrelevant) were used in the TSC experiments. The Covid-Sentiment is a manually labelled data which is a subset derived from [63] using Tweets ID for $1^{\text{st}}$ April 2020 and $1^{\text{st}}$ May 2020. 47,386 tweets were extracted with the help of Twitter API. The tweets in languages other than English (ex: French, Hindi, Mandarin, Portuguese) were removed. Extensive pre-processing of 29,981 English language tweets from the original dataset such as removal of HTML tags, @Username, Hashtags, URLs, and incorrect spellings was performed. A total of 8,003 hand labelled tweets were prepared for experimentation. The 20 Newsgroups dataset introduced in [64], contains approximately 20,000 newsgroup posts and this dataset is partitioned (nearly) evenly across 20 different newsgroups organized into broader categories: computers, recreation, religion, science, sale, and politics. Details of the curated datasets used in the TSC experiments can be found in [50].

Table 3
Dataset information

Dataset	Type	Size	Positive	Negative	Neutral	Irrelevant
Covid-Sentiment	Train	7000	22.02%	30.35%	47.63%	–
	Test	1003	23.53%	37.29%	39.18%	–
U.S. Airline Sentiment [65]	Train	12000	16.79%	61.02%	22.19%	–
	Test	1000	13%	67.5%	19.5%	–
IMDB Movie Review [66]	Train	20000	50.27%	49.73%	–	–
	Test	2000	50.35%	49.65%	–	–
SST-2 [67]	Train	15000	55.37%	44.63%	–	–
	Test	1500	55.53%	44.47%	–	–
Sentiment140*	Train	15000	50%	50%	–	–
	Test	1000	50%	50%	–	–
SemEval 2017 [68]	Train	17001	40.67%	15%	44.33%	–
	Test	3546	41.54%	15.76%	42.70%	–
Sanders corpus $\dagger$	Train	4059	10.24%	11.38%	45.26%	33.12%
	Test	1015	9.85%	10.54%	47.68%	31.93%
UCI Sentence [69]	Train	2700	49.11%	50.89	–	–
	Test	300	58%	42%	–	–
Dataset	Type	Size	World	Sports	Business	Science
AG-News [70]	Train	12000	3000	3000	3000	3000
	Test	1150	300	250	300	300

${}^{*}$ https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf. ${}^{\dagger}$ http://www.sananalytics.com/lab/twitter-sentiment/.

6. Results and discussion

Precision@30 metric was used for evaluating the TPL and FRL classifier performance. In each iteration, Precision@N is calculated as follows:

$\displaystyle{\rm{Precision@N}}=\frac{{{\text{number of correct examples}}}}{{% {\text{N top-ranked examples}}}}$ (10)

As the dataset is not labeled, the correctness of an instance was judged manually.

6.1 Categorizing nouns

11 categories were used in the experiments: Company, Disease, KitchenItem, Person, PhysicsTerm, Plant, Profession, Sociopolitics, Website, Vegetable, Sport. Each category was initialized by 5–8 seed nouns (trusted) for 10 iterations. For every category, the top 5 new nouns promoted as “trusted” per iteration. The tolerance threshold was set to $\theta=$ 50% and the weight values were set to: $\alpha=$ 0.5, $\beta=$ 0.25 and $\gamma=$ 0.25.

Table 4
TPL 2.0 and TPL 1.0 [71] and FRL [59] and CBS [48] results with Precision@30 per category

Categories	Iteration 5				Iteration 10
	TPL 2.0	TPL 1.0	CBS	FRL	TPL 2.0	TPL 1.0	CBS	FRL
Company	100	100	100	100	100	100	100	100
Disease	100	100	100	100	100	100	100	100
KitchenItem	97	100	94	97	97	100	94	73
Person	100	100	100	100	100	100	100	100
PhysicsTerm	97	93	100	67	97	90	100	77
Plant	94	100	100	77	97	97	100	100
Profession	100	100	100	100	100	100	87	100
Sociopolitics	94	100	47	93	97	100	34	87
Sport	100	97	97	100	100	100	100	100
Website	97	90	94	97	97	90	90	93
Vegetable	74	93	83	83	90	63	48	47
Average	95.7	97.5	92	92	97.7	94.5	87	89

6.2 Categorizing relations

The following 11 categories were used: Athlete_Team, CEO_Company, City_Country, City_State, Coach_Team, Company_City, Stadium _City, State_Capital, State_Country, Team_vs_Team. 6 seed instances were used for 10 iterations. In every iteration, the top 5 new relations in every category were promoted as trusted relations for subsequent iterations. All the parameter values for $\theta,\alpha,\beta,$ and $\gamma$ were the same as the ones used for categorizing nouns. To evaluate promotion-based results, the same steps implemented by CPL [47] were applied. 30 pairs were sampled from all the promoted pairs and Precision@30 values was calculated.

Table 5
Precision@30 of TPL 1.0, CPL and FRL for promotion-based method

Categories	TPL			FRL			CPL
	1	5	10	1	5	10	10
Athlete_Team	100	96	87	100	100	83	100
CEO_Company	100	100	100	100	100	100	100
City_Country	100	100	100	100	93	96	93
City_State	100	100	100	100	100	100	100
Coach_Team	100	100	93	100	100	100	100
Company_City	40	84	97	100	100	100	50
Stadium _City	80	92	70	80	92	90	100
State_Capital	100	100	63	100	88	43	60
State_Country	100	100	100	100	100	100	97
Team_vs_Team	100	84	80	100	96	100	100
Average	92.0%	95.6%	89.0%	98.0%	96.6%	91.2%	90.0%

Table 5 gives the promotion-based results for all three algorithms. Table 6 gives the ranking-based results. In both ranking and promotion-based methods, correctness results were verified manually. Bold-faced values indicate average values for all categories at the end of iteration 10. Based on the results in Table 6, the average precision after 10 iterations for the proposed FRL algorithm is significantly better than TPL using a ranking-based method. For this data set, one can observe that FRL is able to handle concept drift better than TPL in all categories. It is important to note that FRL enforces mutual exclusion constraint, whereas TPL was able to maintain high precision with no externally defined constraints. From Table 5, it can be seen that FRL performs better than TPL and CPL. It is noteworthy that CPL enforces 3 forms of constraints during the learning process [47]. Here FRL was able to do better than TPL in terms of concept drift only in two categories, Athlete_Team and Team_vs_Team. With CPL, the only result that was available was after the tenth iteration.

Table 6

Precision@30 for relations TPL 1.0 and FRL for ranking-based method

Categories	TPL			FRL
	Iter. 1	Iter. 5	Iter.10	Iter. 1	Iter. 5	Iter. 10
Athlete_Team	100	90	87	97	100	97
CEO_Company	100	100	100	100	100	100
City_Country	100	100	100	93	100	100
City_State	100	100	100	97	100	100
Coach_Team	93	93	93	100	100	100
Company_City	83	90	93	97	100	100
Stadium _City	97	93	80	93	70	93
State_Capital	100	97	73	93	83	76
State_Country	100	100	100	90	100	100
Team_vs_Team	93	83	80	100	100	100
Average	96.6%	94.6%	90.6%	96%	95.3%	96.7%

Table 7

TPL’s top-16 ranked instances for selected categories. Incorrect instances are boldfaced [44]

PhysicsTerms	Socipolitics	Vegetables
Iteration 1
Inertia	Socialism	Zucchini
Acceleration	Democracy	Spinach
Gravity	Dictatorship	Cucumber
Buoyancy	Monarchy	Tomato
Velocity	Independence	Broccoli
Momentum	Justice	Lettuce
Magnetism	Equality	Celery
Resonance	Pluralism	Cabbage
Curvature	Internationalism	Kale
Electromagnet.	Federalism	Cauliflower
Density	Secularism	Asparagus
Elasticity	Liberalism	Carrots
Surface Tension	Hegemony	Tomatoes
Polarization	Self-determ.	Avocado
Vibration	Unification	Eggplant
Entropy	Capitalism	Carrot
Iteration 10
Density	Humanism	Zucchini
Conductivity	Pluralism	Cabbage
Intensity	Federalism	Kale
Viscosity	Internationalism	Celery
Permeability	Nationalism	Cauliflower
Velocity	Rationality	Eggplant
Brightness	Liberalism	Carrots
Attenuation	Secularism	Asparagus
Luminosity	Individualism	Tomatoes
Reflectance	Democracy	Spinach
Sensitivity	Environmentalism	Squash
Amplitude	Morality	Cucumber
Thickness	Pragmatism	Melon
Frequency	Spirituality	Chicken
Water cont.	Regionalism	Tofu
Salinity	Subjectivity	Shrimp

6.3 Concept drift

One of the issues with semi-supervised learning is the problem of concept drift where incorrect categorizing of entities results over a period of time due a limited number of trusted (seed) labeled examples and a large number of unlabelled examples. This problem is mitigated by constraining the learning process (e.g., simultaneously learning independent classifiers, specifying mutual exclusion or other constraints). Table 7 is a illustration of incorrect promotions during the iterative learning process.

The motivation for designing TPL 2.0 was to explore its performance two important issues: scalability (larger noun and contextual patterns) and concept drift (observed in terms of increased number of iterations). The precision value at the end of the twentieth iteration for TPL 2.0 was 96.2%. The choice of a tolerance rough set-based learner was motivated by the fact that TPL 1.0 did not require definition of any external constraints to constrain the learning process when compared with the three bench-marked algorithms: CPL, CBS and FRL.

Table 8
SBERT vector-based weighted F1-score (rounded) results for six classifiers. Best results are in bold-face

Dataset	TSC-mean	RF	ME	SVM	SGD	LGBM
Covid-Sentiment	55	44	57	57	57	56
U.S. Airline	77	77	77	77	75	77
IMDB	76	69	73	73	72	72
SST-2	85	85	85	86	85	85
Sentiment140	70	68	72	72	66	70
SemEval 2017	60	54	64	63	63	60
Sanders corpus	69	70	76	74	76	75
UCI Sentence	89	84	86	87	87	83
AG-News	82	79	88	81	88	83
20-Newsgroups	66	41	58	52	52	53

Table 9

SBERT vector-based AUC-ROC score for six classifiers. Best results are in bold-face

Dataset	TSC-mean	RF	ME	SVM	SGD	LGBM
Covid-Sentiment	0.65	0.71	0.68	0.70	0.70	0.67
U.S. Airline	0.75	0.77	0.76	0.77	0.77	0.79
IMDB	0.76	0.83	0.83	0.86	0.88	0.87
SST-2	0.85	0.82	0.81	0.82	0.82	0.73
Sentiment140	0.70	0.71	0.74	0.72	0.73	0.73
SemEval 2017	0.66	0.71	0.71	0.71	0.70	0.72
Sanders corpus	0.77	0.77	0.78	0.79	0.80	0.76
UCI Sentence	0.90	0.72	0.78	0.77	0.76	0.67
AG-News	0.87	0.86	0.90	0.90	0.90	0.88
20-Newsgroups	0.82	0.81	0.86	0.86	0.87	0.82

6.4 TSC results

Table 8 gives weighted F1-score with the following benchmark algorithms: Random Forest (RF) [72], Maximum Entropy (ME), Support Vector Machine (SVM) [73], Stochastic Gradient Decent (SGD) [74] and Light Gradient Boosting Machine (LGBM) [75] classifier implementations in Scikit-learn.7

⁷
https://scikit-learn.org/stable/.

The range of tolerance values used by TSC for different datasets varied from 0.08 to 0.38. The best performance was observed in the U.S. Airline, IMDB, UCI SST-2 and 20-Newsgroup datasets and comparable with the COVID-Sentiment, SST-2 and Sentiment 140 datasets. Our experiments leads us to conclude that balanced tolerance classes with two sentiment classes (e.g., SST-2 and UCI sentence datasets) generated by the SBERT vectors results in good semantic similarity between vectors. The weighted F1-score with these two datasets is over 85%. Table 9 gives the AUC-ROC score for all the tested classifiers. A pair-wise comparison of all combinations (ovo) with the weighted average score parameter was used for obtaining the results. Based on the AUC-ROC score, the TSC algorithm performs best in the UCI and SST-2 datasets that have approximately similar number of tolerance classes and two classes. It is also noteworthy that the UCI and AGnews datasets have the best separability (90%) and IMDB (88%), 20-Newsgroups (87%) and SST-2 (85%) have scores above 84% based on the tested classifiers. Since SGD is a linear SVM, for most datasets, the weighted F1 scores are either similar or identical. Results with AUC-PRC score and a more detailed reporting of results can be found in [61]. Overall, it was observed that TSC performs poorly with highly imbalanced datasets having more than two sentiment classes.

7. Conclusion and future work

In this paper, we review novel tolerance-based rough and fuzzy-rough hybrid models that were developed specifically for representing linguistic entities as well as semi-supervised machine learning algorithms for categorizing linguistic relational facts from web corpora. We also present a recently developed tolerance near-set based supervised machine learning algorithm for non-topic classification (e.g., sentiments and text). Performance of the presented algorithms was discussed in terms of bench marked datasets, and algorithms. With the tolerance rough set model and semi-supervised learning, the practical challenges when working with large NER datasets can be summarized as: i) determination of an optimal tolerance value $\varepsilon$ , and the weights $\alpha$ , $\beta$ and $\gamma$ , ii) efficient construction and processing of the crisp co-occurrence matrix. With the fuzzy rough set model, the challenges are: i) to determine the optimal parameters $\alpha$ and $\beta$ for the S-function to construct a graded co-occurrence matrix, ii) to define constraints to minimize concept drift.

Future work with NER applications and tolerance-based granular methods will involve categorizing fine-grained entity types. For example, in BioNER (biomedical NER), with severe acute respiratory syndrome virus, other fine-grained types such as sars, mers, covid-19 in addition to categorizing common biomedical entity types such as genes, chemicals and diseases is important. With tolerance near set-based classification learning, experiments with other word embeddings methods (e.g., Word2Vec), as well as examining the impact of the bag-of-words representation dimensionality will be part of work in future. Experiments with other pre-trained transformer-based vectors (besides SBERT) would be considered. Parallel computing with CUDA8

⁸

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

for co-occurrence and distance matrices are a natural extension for optimizing the performance of the granular tolerance-based algorithms.

Footnotes

Acknowledgments

Sheela Ramanna’s research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) discovery grant 194376. The author wishes to acknowledge the contribution of her graduate students: Cenker Sengoz, Aditya Bharadwaj, Hoora Rezai Moghaddam and Vrushang Patel.

References

Nadeau

Sekine

. A survey of named entity recognition and classification. Linguisticae Investigationes. 2007 January; 30(1): 3-26. Publisher: John Benjamins Publishing Company. Available from: http://www.ingentaconnect.com/content/jbp/li/2007/00000030/00000001/art00002.

Pang

Lee

Vaithyanathan

. Thumbs up? Sentiment Classification using Machine Learning Techniques. arXiv; 2002. Available from: https://arxiv.org/abs/cs/0205070.

Giachanou

Crestani

. Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys. 2017; 49(2): 1-41.

Medhat

Hassan

Korashy

. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal. 2014; 5(4): 1093-1113. Available from: https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S2090447914000550.

Zhang

Wang

Liu

. Deep Learning for Sentiment Analysis : A Survey. arXiv; 2018. Available from: https://arxiv.org/abs/1801.07883.

Hussein

DMEDM

. A survey on sentiment analysis challenges. Journal of King Saud University – Engineering Sciences. 2018; 30(4): 330-338. Available from: https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S1018363916300071.

Zadeh

. Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Systems. 1997; 177(19): 111-127.

Pawlak

. Rough sets. International Journal of Computer & Information Sciences. 1982; 11(5): 341-356.

Peters

. Near sets. General theory about nearness of objects. Applied Mathematical Sciences. 2007; 1(53): 2609-2029.

10.

Peters

. Near sets. Special theory about nearness of objects. Fundamenta Informaticae. 2007; 75(1-4): 407-433.

11.

Poincaré

. Science and Hypothesis. Brock University: The Mead Project; 1905. L. G. Ward’s translation.

12.

Zeeman

. The topology of the brain and visual perception. Univ of Georgia Institute Conf Proc. 1962. pp. 240-256. M.K. Fort, Jr. (Ed.), Topology of 3-Manifolds and Related Topics, Prentice-Hall, Inc.

13.

Sossinsky

. Tolerance space theory and some applications. Acta Applicandae Mathematicae: An International Survey Journal on Applying Mathematics and Mathematical Applications. 1986; 5(2): 137-167.

14.

Polkowski

Skowron

Zytkow

. Tolerance Based Rough Sets. In: Lin TY, Wildberger M. (eds.) Soft Computing: Rough Sets, Fuzzy Logic, Neural Networks, Uncertainty Management, Knowledge Discovery. San Diego: Simulation Councils Inc.; 1994. pp. 55-58.

15.

Nieminen

. Rough tolerance equality and tolerance black boxes. Fundamenta Informaticae. 1988; 11: 289-296.

16.

Novotný

Pawlak

. Black Box Analysis and rough top equality. Bull Polish Academy of Sciences, Technical Sciences. 1985; 33: 105-113.

17.

Marcus

. Tolerance rough sets, cech topologies, learning processes. Bull Polish Academy of Sciences, Technical Sciences. 1994; 42(3): 471-487.

18.

Schroeder

, M W. Tolerance and weak tolerance relations. Journal of Combinatorial Mathematics and Combinatorial Computing. 1992; 11: 123-160.

19.

Nakamura

. Fuzzy rough sets. Note on Multiple-Valued Logic in Japan. 1988; 9(8): 1-8.

20.

Dubois

Prade

. Rough fuzzy sets and fuzzy rough sets*. International Journal of General System. 1990; 17(2-3): 191-209.

21.

Del Cerro

Prade

. Rough sets, twofold fuzzy sets and modal logic Fuzziness in indiscernibility and partial information. The Mathematics of Fuzzy Systems. 1986; 103-120.

22.

De Cock

Cornelis

. Fuzzy rough set based web query expansion. In: Proceedings of Rough Sets and Soft Computing in Intelligent Agent and Web Technology; 2005. pp. 9-16.

23.

Cock

Cornelis

Kerre

. Fuzzy Rough Sets: Beyond the Obvious. In: Proceedings of the 2004 IEEE International Conference on Fuzzy Systems. Vol. 1; 2004. pp. 103-108.

24.

Poli

Llapa

Cecatto

Saito

Peters

Ramanna

, et al. Solar flare detection system based on tolerance near sets in a GPU-CUDA framework. Knowledge-Based Systems Journal, Elsevier. 2014; 70: 345-360.

25.

Kawasaki

Nguyen

. Hierarchical Document Clustering Based on Tolerance Rough Set Model. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery; 2000. pp. 458-463.

26.

Nguyen

. Nonhierarchical document clustering based on a tolerance rough set model. International Journal of Intelligent Systems. 2002; 17: 199-212.

27.

Ngo

. A tolerance rough set approach to clustering web search results. Warsaw University; 2003.

28.

. Web Document Classification Based on Extended Rough set. In: Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT’05); 2005. pp. 916-919.

29.

Ding

Wang

. On-line hot topic recommendation using tolerance rough set based topic clustering. J Comput. 2010; 5: 549-556.

30.

Shi

Duan

Zhao

. Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Syst Appl. 2011 May; 38(5): 6300-6306.

31.

Virginia

Nguyen

. Lexicon-based document representation. Fundamenta Informatica. 2013; 124(1-2): 27-46.

32.

Virginia

Nguyen

. A semantic text retrieval for indonesian using tolerance rough sets models. Transactions on Rough Sets. 2015; LNCS 8988(XIX): 138-224.

33.

Świeboda

Meina

Nguyen

. Weight Learning for Document Tolerance Rough Set Model. In: Lingras P, Wolski M, Cornelis C, Mitra S, Wasilewski P, eds. Rough Sets and Knowledge Technology. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013. pp. 385-396.

34.

Nguyen

. An Approach to Semantic Indexing Based on Tolerance Rough Set Model. In: Nguyen NT, van Do T, le Thi HA, eds. Advanced Computational Methods for Knowledge Engineering. Heidelberg: Springer International Publishing; 2013. pp. 343-354.

35.

Swieboda

Krasuski

Nguyen

Janusz

. Interactive method for semantic document indexing based on explicit semantic analysis. Fundam Informaticae. 2014; 132(3): 423-438. Available from: https://doi.org/10.3233/FI-2014-1052.

36.

Nguyen

. Applications of Tolerance Rough Set Model Semantic Text Analysis. In: Ropiak K, Polkowski L, Artiemjew P, eds. Proceedings of the 28th International Workshop on Concurrency, Specification and Programming, Olsztyn, Poland, September 24–26th, 2019. vol. 2571 of CEUR Workshop Proceedings. CEUR-WS.org; 2019. Available from: http://ceur-ws.org/Vol-2571/CSP2019_paper_18.pdf.

37.

Ramanna

Peters

Sengoz

. Application of tolerance rough sets in structured and unstructured text categorization: a survey. In: Thriving Rough Sets. Springer; 2017. pp. 119-138.

38.

Srinivasan

Ruiz

Kraft

Chen

. Vocabulary mining for information retrieval: Rough sets and fuzzy sets. Information Processing & Management. 2001; 37(1): 15-38.

39.

Cornelis

De Cock

Radzikowska

. Fuzzy Rough Sets: From Theory into Practice. In: Handbook of Granular Computing. John Wiley & Sons, Ltd; 2008. pp. 533-552.

40.

Behera

Kumaravelan

. Text document classification using fuzzy rough set based on Robust nearest neighbor. Soft Computing. 2021 08; 25.

41.

Mitchell

Cohen

Hruschka

Talukdar

Betteridge

Carlson

, et al. Never-ending learning. Communications of the ACM. 2018; 61(5): 103-115.

42.

Dong

Gabrilovich

Heitz

Horn

Lao

Murphy

, et al. Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA; 2014. pp. 601-610.

43.

Curran

Murphy

Scholz

. Minimising semantic drift with mutual exclusion bootstrapping. In: Proc. of PACLING; 2007. pp. 172-180.

44.

Sengoz

. A Granular-based Approach for Semi-supervised Web Information Labeling. University of Winnipeg; 2014. Supervisor: S.Ramanna.

45.

Bharadwaj

. Fuzzy Rough Set Based Unstructured Text Categorization. University of Winnipeg; 2017. Supervisor: S.Ramanna.

46.

Moghaddam

. Exploring Scalability and Concept drift issues in Learning Categorical facts with Tolerance Rough Sets. University of Winnipeg; 2019. Supervisor, S.Ramanna.

47.

Carlson

Betteridge

Wang

Hruschka

Jr Mitchell

. Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining; 2010. pp. 101-110.

48.

Verma

Hruschka

Jr. Coupled Bayesian Sets Algorithm for Semi-supervised Learning and Information Extraction. In: ECML PKDD Part II LNCS 7524; 2012. pp. 307-322.

49.

Moghaddam

Ramanna

. Harvesting patterns from textual web sources with tolerance rough sets. Cell Press, Elsevier. 2020; 1(4): 100053.

50.

Patel

. Short Text Classification with Tolerance Near Sets. University of Winnipeg; 2021. University of Winnipeg, Supervisor: S.Ramanna.

51.

Devlin

Chang

Lee

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

52.

Reimers

Gurevych

. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2019. Available from: https://arxiv.org/abs/1908.10084.

53.

Skowron

Stepaniuk

. Tolerance approximation spaces. Fundam Inf. 1996 Aug; 27(2,3): 245-253.

54.

Sengoz

Ramanna

. A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets. In: AMT 2014, LNCS 8610; 2014. pp. 1-10.

55.

Sørensen

. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Biologiske skrifter. I kommission hos E. Munksgaard; 1948.

56.

Seeratpal

Ramanna

. Named entity recognition on CORD-19 bio-medical dataset with tolerance rough sets. Transactions on Rough Sets, Springer. 2022; XXIII(13610): 1-10. To appear.

57.

Jensen

Shen

. Computational intelligence and feature selection: rough and fuzzy approaches. vol. 8. John Wiley & Sons; 2008.

58.

Radzikowska

Kerre

. A comparative study of fuzzy rough sets. Fuzzy Sets and Systems. 2002; 126: 137-156.

59.

Bharadwaj

Ramanna

. Categorizing relational facts from the web with fuzzy rough sets. Knowledge and Information Systems. 2019; 61(3): 1695-1713. doi: 10.1007/s10844-018-0505-8.

60.

Wolski

. Perception and classification. A Note on Near sets and Rough sets. Fundamenta Informatica. 2010; 101: 143-155.

61.

Patel

Ramanna

Kotecha

Walambe

. Short text classification with tolerance-based soft computing method. Algorithms. 2022; 15(8). Available from: https://www.mdpi.com/1999-4893/15/8/267.

62.

Wasilewski

Peters

Ramanna

. Perceptual tolerance intersection. Transactions on Rough Sets Journal. 2011; 13: 159-174.

63.

Chen

Lerman

Ferrara

. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health and Surveillance. 2020;6(2):e19273.

64.

Lang

. Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings 1995. Elsevier; 1995. pp. 331-339.

65.

Rane

Kumar

. Sentiment classification system of Twitter data for US airline service analysis. In: 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC). vol. 1. IEEE; 2018. pp. 769-773.

66.

Maas

Daly

Pham

Huang

Potts

. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; 2011. pp. 142-150.

67.

Socher

Bauer

Manning

. Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2013. pp. 455-465.

68.

Rosenthal

Farra

Nakov

. SemEval-2017 task 4: Sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); 2017. pp. 502-518.

69.

Kotzias

Denil

De Freitas

Smyth

. From group to individual labels using deep features. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2015. pp. 597-606.

70.

Zhang

Zhao

LeCun

. Character-level convolutional networks for text classification. arXiv preprint arXiv:150901626. 2015.

71.

Sengoz

Ramanna

. Learning relational facts from the web: A tolerance rough set approach. Pattern Recognition Letters. 2015; 67(P2): 130-137.

72.

Breiman

. Random forests. Machine Learning. 2001 10; 45: 5-32.

73.

Chang

Lin

. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011 4; 2: 1-27.

74.

Bottou

Bousquet

. The Tradeoffs of Large Scale Learning. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. vol. 20; 2007. pp. 1-9.

75.

Tianqi

Carlos

. XGBoost. In: Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems. ACM; 2016. pp. 785-794. Available from: doi: 10.1145/2F2939672.2939785.

Tolerance-based granular methods: Foundations and applications in natural language processing

Abstract

Keywords

1. Introduction

1 https://wordnet.princeton.edu/.

2.1 Tolerance rough sets

(Tolerance Rough Set).

2.1.1 Document representation with TRS

(Document Approximation Space).

.

.

.

.

2.1.2 Named entity representation with TRS

(TRS Approximation Space for nouns [54]).

.

.

.

(Trusted relations).

(Co-occurrence).

.

.

(FRS Approximation Space for relations [54]).

.

(Fuzzifying the co-occurrence function [22, 59]).

.

(Upper approximation).

.

(Lower approximation).

.

.

(Tight Upper approximation [22]).

.

(micro score).

.

.

.

.

.

.

.

3.1 Semi-supervised learning with TRS and FRS

.

(Promotion scoring).

.

5. Materials

5.1 NER dataset for TPL 1.0. TPL 2.0 and FRL experiments

2 https://lemurproject.org/clueweb09/.

Table 3 Dataset information

Table 4 TPL 2.0 and TPL 1.0 [71] and FRL [59] and CBS [48] results with Precision@30 per category

Table 5 Precision@30 of TPL 1.0, CPL and FRL for promotion-based method

Table 8 SBERT vector-based weighted F1-score (rounded) results for six classifiers. Best results are in bold-face

7 https://scikit-learn.org/stable/.

Footnotes

Acknowledgments

References

¹
https://wordnet.princeton.edu/.

²
https://lemurproject.org/clueweb09/.

Table 3
Dataset information

Table 4
TPL 2.0 and TPL 1.0 [71] and FRL [59] and CBS [48] results with Precision@30 per category

Table 5
Precision@30 of TPL 1.0, CPL and FRL for promotion-based method

Table 8
SBERT vector-based weighted F1-score (rounded) results for six classifiers. Best results are in bold-face

⁷
https://scikit-learn.org/stable/.