Abstract
As a data utility and aided tool, ontology has been widely used in many areas of the computer. Owing to its great efficiency, ontologies have also been introduced into various engineering disciplines. In this paper, we present the fundamental ideas of how to deal with similarity measuring problem in ontology learning algorithms. The mathematical basis of ontology learning algorithms is also introduced from a statistical learning theory point of view. Finally, we present two ontology learning algorithms in multi-dividing setting and ontology sparse vector learning setting, respectively.
Introduction
The term “ontology”, derived from philosophy, was introduced into the field of artificial intelligence in the early 1990 s. As a model of conceptual semantic storage, analysis and management, it has drawn great attention from the field of computer science and information technology. In the 21st century, scholars from various disciplines are using ontology tools to cope with various engineering problems, making the ontology popular in multi-science research as well, such as biology (Koehler et al. [1]), pharmacy (Przydzial et al. [2] and Bhattacharjee et al. [3]), education system (Demartini et al. [4]), psychology (Simut [5]), medicine (Shah et al. [6], Zghal and Moreno [7] (Mignard and Nicolle [8], and Brisaboa et al. [9]), neuroscience (Bowden et al. [10]), and nanotechno-logy.
In recent years, ontology methods have been applied to various ontology engineering applications. Ashraf et al. [12] discussed OUSAF (Ontology USage Analysis Framework) which is often used in large RDF data. Ongsri and Roddick [13] considered combining ontologies into three popular conceptual model approaches. Sorokine et al. [14] proposed a new ontology-driving-based information system. Groza et al. [15] explained the semantic unification in human phenotype ontology of common and rare diseases. Cannataro et al. [16] raised an ontology-based algorithm for fixed genome sequencing and gene sequencing based on the semantic similarity of each gene corresponding disease.
In general, the ontology structure is represented by the graph model G = (V, E), where each vertex on the ontology represents a concept of ontology. Each edge e = v i v j in the graph indicates that there is a direct relationship between the concepts v i and v j . The similarity between the two concepts in ontology can be expressed by the similarity of vertices. The similarity of vertices v i and v j is denoted by Sim (v i , v j ), whose value is a real number between [0,1]. Sim (v i , v j ) = 1 indicates that the concepts corresponding to v i and v j are the same. Sim (v i , v j ) = 0 means that there is no relationship between v i and v j . The threshold M is given by experts in the field. When Sim (v i , v j ) > M, the similarity between v i and v j is greater than M, and then it returns a set of concepts with higher similarity to the user.
If we consider this problem from a mathematical perspective, the goal of ontology similarity calculation is to find the optimal similarity calculation function
Ontology similarity calculation is the key of ontology algorithm and subject application. Early ontology similarity calculation methods are mostly heuristic. To sum up, these heuristic calculation methods follow the pattern below:
where α i (1 ≤ i ≤ 4) are all real numbers, and the values are between 0 and 1. These equilibrium parameters are given by experts in the field or empirically determined, and satisfy α1 + α2 + α3 + α4 = 1. Sim name (v i , v j ) represents the name similarity between the concepts corresponding to v i and v j . Sim instance (v i , v j ) represents the instance similarity between the concepts corresponding to v i and v j , and Sim attribute (v i , v j ) represents the attribute similarity between the concepts corresponding to v i and v j . Sim structure (v i , v j ) represents the structural similarity between concepts corresponding to v i and v j . In other words, the similarity between v i and v j is weighted by the similarity of their names, instances, attributes and structures. In the formula for calculating the similarity of ontology vertices, there will be other parameters to be determined by specialists in the field. Traditionally, the parameters that are multiplication coefficients in α i (1 ⩽ i ⩽ 4) and partial ontology similarity calculation formulas are called the first type of parameters, and the parameters inside the partial ontology similarity calculation formula are called the second type of parameters (for example, equilibrium coefficient appeared in a part of the vertex similarity calculation formula molecules). The shortcomings of this heuristic ontology similarity calculation model above lie in: 1) almost all the parameters need to be decided by experienced domain experts in practical operation. When they are unable to find domain experts, the given parameters may be unreasonable, and the gap is relatively large, which will directly affect the efficiency of the algorithm; 2) This method to calculate the similarity in pairs is not clear and direct, and it needs a large amount of computation. Meanwhile, a detailed analysis for ontology structures needs to be done in advance.
To sum up, under the background of big data, the heuristic method of calculating the similarity formula has great limitations. In order to solve the above problems and break through the restriction of traditional given parameters ontology calculation model, the machine learning method is widely applied to the ontology similarity calculation. The core idea of ontology conceptual similarity calculation algorithm based on machine learning method can be described as follows: According to the learning of sample points, the optimal ontology function
In this section, we introduce two main ontology algorithms in multi-dividing ontology setting and ontology sparse vector computing setting, respectively.
Ontology algorithm in multi-dividing setting
In recent years, many scholars have proposed a variety of techniques for calculating ontologysimilarity and ontology mapping, and used them in multidisciplinary projects. One of the methods is to use the ontology graph structure to get all the vertices of the ontology graph divided into k parts for k levels. The level of value is determined by domain experts. In this framework, for ontology function f, the real number corresponding to the vertex in level a is greater than the real number corresponding to the vertex in any level b, where 1 ⩽ a < b ⩽ k. At last, the similarity between ontology vertices is determined by the absolute value of the difference between their corresponding real numbers. Known as multi-dividing based ontology learning algorithms, these algorithms have been proven to be applicable to many engineering fields. The following two examples are listed to illustrate how the algorithm is applied to the project.

Genetics “GO”ontology graph.

Phytology “PO”ontology graph.

“University” Ontology O1.

“University” Ontology O2.
It can be seen from the above four examples, when the ontology structure is a tree, the multi-dividing framework is suitable for ontology similarity calculation and ontology mapping. However, in the field of engineering applications, most of the ontology graphs are tree-structured for the needs of hierarchical representation, so the ontology function learning algorithm under the multi-dividing framework can be well implemented in practice.
In the early years, there had been some theoretical analysis results for the algorithm of multi-dividing ontology learning (see Gao et al. [17–27], Zhu et al. [28], Maldonado et al. [29], Ansari et al. [30], and Fouda et al. [31] for more details).
In ontologies related to engineering such as physics, chemistry, materials, medicine and pharmacy, due to the complex molecular structure information of genes and others, semantic information and related structures, genes, information to be contained in the ontologies are quite huge. As a result, the design of ontology learning algorithm has higher requirements in terms of algorithm complexity and algorithm optimization. In order to obtain the similarity calculation results quickly and efficiently, the sparse algorithm learning is introduced into ontology similarity calculation and ontology mapping. Based on a particular application domain and the special structure information of the ontology graph under this background, an ontology sparse vector learning algorithm suitable for a specific application background is designed, so as to obtain the optimal ontology function, and then obtain the similarity calculation method between ontology concepts. It is the trend of the design of ontology learning algorithm in the future big data framework.
In the representation of the ontology concept information, a vector is often used to characterize all the information corresponding to the concept. Due to the complexity of information, the dimension of this vector may be very large in some specific application fields. The essence of ontology function is a dimensionality reduction operator, which uses high-dimensional data to represent one-dimensional real numbers. In this sense, the essence of ontology learning algorithm is dimensionality reduction algorithm. Based on the ontology algorithm of ontology sparse vector, the basic idea is that for a specific application area, effectively shield irrelevant feature information, thereby enhancing the valuable information. For example, in genetics, a genetic disease associated with only a few genes, and millions of other genes in the human body has no connection with the disease, the goal of ontology sparse vectors is to effectively find these genes that have decisivesignificance.
In Mathematics, a sparse vector is defined as a vector whose major components are 0, and then it multiplied the corresponding vector of any one of the ontology concept. Then, what really works is only the key information that corresponds to a small portion of the original ontology vector.
Introduce of fundamental of statistical machine learning
Some examples of machine learning
This main algorithm is based on statistical learning theory to analyze machine learning based ontology algorithm. We first give a few examples to illustrate what machine learning is.
Traditionally, to measure the approximation effect, we set a straight line with the least squares:
and the best straight line is the one makes
And then the smallest square can be expressed as minimized:
where:
In general, the value of sample size n is much larger than N. It can be seen that the essence of the above learning is the learning coefficient vector, and its classical solution comes from linear algebra and numerical calculation.
Since the value of y
i
is affected by the noise, for any
The approximation of the function can be stated as
where φ
i
is a group of bases in the function space. They are not necessarily polynomials, and the goal of learning is to get the coefficient vector
Here, e
i
is a vector in
Denote f
S
and f
T
are the characteristic functions of S and T, respectively, and then:
Is the error of S. The values of f
S
and f
T
can only be 0 or 1.Therefore, the integral is equivalent to
Let C be a class of subsets of
To approximates the integral value:
By setting a weak regularization condition on the function f, I m (f) → ∫f holds at probability 1, that is, for all ɛ > 0, we have:
In this example, the measure makes the sample knowable (the measure on [0, 1]
n
inherits the characteristics of the standard Lebesgue measure on
Let X be a tight metric space,
For f : X → Y, its generalized error is defined as:
where (f (x) - y) 2 is called local error. An important problem in learning is how to reduce the value of generalized error. A classical method is to decompose L (f) into multiple parts.
For each x ∈ X, let ρ (y|x) be the conditional probability measure of x on Y, ρ
X
is the Marginal probability measure of ρ on X, which is stated as:
where π : X × Y → X is projection operator (bounded, linear operators in normed linear space). For any integral function
∫X×Yφ (x, y) dρ = ∫ X (∫ Y φ (x, y) dρ (y|x)) dρ X decompose ρ into ρ (y|x) and ρ X , divide Z into input domain X and output domain Y, and maps their direct products.
Let f
ρ
: X → Y, which is stated as:
The function f
ρ
is called the regression function of ρ. For each x ∈ X, f
ρ
(x) is the mean of the ordinate of {x} × Y. The regularity of ρ can help to get the regularity of f
ρ
, and we always assume that f
ρ
is bounded. Let x ∈ X, consider a function from Y to
If we average on X, then
We can measure the quality of ρ by considering the value of
Since
Consider the sampling, let
Be samples on Z
m
, that is, m instances follow the independent distribution of ρ, and Z
m
represents the m Cartesian product on Z. The empirical error of f with respect to z is defined as:
If ξ is the random variable on Z, then denote the empirical mean of ξ with respect to z as E
For any function f : X → Y, denote f
Y
: X × Y → Y as (x, y) ↦ f (x) - y. Then:
And the mathematical expectations of (f
ρ
)
Y
is 0, and the variance is
Let C (X) be a Banach space formed by a continuous function on X with the following generalizations:
Considering the subspace H of C (X), the learning algorithm looks for the function closest to f ρ from H, which is called Hypothesis Space. In the process of machine learning, H is the key element. If H is too small, the performance of the optimal function will be not good enough. If H is too large, the complexity of the algorithm will be greatly increased. We always hope f ρ ∈ H, but in practice, even f ρ ∈ C (X) may not be guaranteed.
Considering f
H
∈ H as the objective function, we define that f
H
is obtained by looking for a function in H that minimizes the error L (f)(assuming a loss function is squared loss):
For
Let
The following example illustrates the existence of f f
H
and f
L
Let f1, f2 ∈ C (X), for almost all
It can be seen from the results that:
① Let H ⊆ C (X), if |f (x) - y| ⩽ M holds for all f ∈ H almost everywhere, then L and L
② Let H ⊆ C (X) be tight, if |f (x) - y| ⩽ M holds for all f ∈ H almost everywhere, then f
H
and f
What needs to be specially explained here is that the functions f
H
and f
Let H be a hypothetical, the error of the function f ∈ H in H can be expressed in normal error:
For any f ∈ H, L
H
(f) ⩾0 and L
H
(f
H
) =0, then, L (f
H
) and L
H
(f) are absolutely different. By simple calculation, we get

Linear regression.
Consider L
H
(f
Then, the Approximation Error can be described as:
Therefore,
The estimation of A (H) has nothing to do with the sample z, and its ρ depends on the regression function f ρ . As for the sample error, once the sample is given, it has no relation with ρ any more, although the sample is determined by ρ.
When the hypothesis space H is fixed, the sample error decreases as the sample size m increases. With a fixed sample size m, as the H range expands, the approximation error decreases but the sample error increases. The bias problem means that when the sample size m is fixed, choosing the size of H to minimize the error L (f
If H is too small, it will lead to large bias, while if H is too large, it will produce large variance. Therefore, the size of H is the core of learning theory. Figures 6 and 7 describe the normal function, functions in the case of large bias and large variance respectively. It can be seen that in the case of large bias, the curve obtained does not reflect the actual situation; in the case of large variance, the resulting curve is in poor smooth, that is to say, the function is “shaking” violently, this learning function can’t be used for other data sets, resulting in poor generalization of the algorithm.

Normal curve.

Large bias.
In what follows, we will turn to ontology setting, and we set V as the ontology instance space because the ontology structure is presented as a graph.
In this section, we aim to present our first ontology algorithm in the confidence weighted multi-dividing setting.
As the concepts mentioned in the first section, the so-called multi-segmentation ontology learning is to use the ontology graph which mostly has a tree structure. The similarity between the concepts belonging to the same branch is greater than that belonging to different branches. From the point of view of data distribution, the data corresponding to the vertices belonging to the same branch are relatively close to each other, and they can be considered to be relatively concentrated in the same area. If these data points are projected as one-dimensional real lines, the data points corresponding to the vertices of the same branch should be concentrated in a certain range. In this feature, a plurality of branches in the tree structure are divided into a plurality of levels, and the real numbers corresponding to the respective branches (rank) are sequentially arranged on the real axis.
According to this idea, let
Let k be the number of divisions after division, V be the input space, or the instance space, whose e1ements are the vertices of the onto1ogy graph. For Y ⊆ {1, 2, ⋯, k} is the hierarchica1 markup space, D is the unknown distribution V × Y. Set
be a set of samples with a capacity of n. The goal of the ontology learning algorithm is to obtain an optimal ontology function
Since each tag y does not represent the value of f (v) under the multiple partitioning framework, it represents the rank of the vertex v and takes an integer between 1 and k. Thus the sample set S can be expressed as
where for any a ∈ {1, 2, ⋯, k}, we have
represents both the vertex and the vector corresponding to the vertex, and the reader can determine the content of the vertex according to the context.
In the multi-segmentation framework, the optimal ontology function we need satisfies the following condition: For 1 ⩽ a < b ⩽ k, the vertex in any class α in f has a larger value than the vertex in any class b, ie f (v) > f (v′). If v ∈ S a and v′ ∈ S b . For a ∈ {1, 2, ⋯, k}, let D a = DV|Y=a be the conditional distribution.
In multi-dividing setting, the optimal ontology function can be expressed by minimizing the following expression:
where I represents the true value function, the result is 1 when the argument is true, and the result is 0 when the argument is false. Since the distribution D is unknown in advance, the above equation can not be calculated directly.
Using ontology sample set S, the empirical ontology model can be determined by minimizing
Although the distribution of the sample obeys D, but when S is given, the sample set is not directly related to the distribution. Thus this version can be calculated directly.
Although the direct calculation is very difficult, it is desirable that the essence of the truth function I is a discrete counting function that is not derivable. Thus, in most cases, it is necessary to set a continuous derivative loss function
The corresponding empirical ontology framework is denoted as
There are some commonly used ontology lost function list as follows (here x is the variable of ontology loss functions):
Hinge loss function:
Least square loss function:
Exponential loss function:
Logarithmic loss function
Symmetric ɛ-insensitive logarithmic loss function:
Here, we use the hinge loss as our ontology loss function, and thus the ontology objective function we find to minimize to linear function can be formulated as follows:
where ∥w∥ is denoted as the Euclidean norm to regularize the ontology order, and for each pair of (a, b), Ca,b is the parameter to punish the sum of errors. We can reformulate the ontology objective function as a sum of two losses for a pair of vertices for each pair of (a, b),
where Ta,b = n
a
+ n
b
and each pair of (a, b)
Instead of maintaining all the obtained vertices to calculate the gradients
The framework of the online confidence weighted multi-dividing ontology algorithm is depicted in Algorithm 1.
Input: the punish parameter Ca,b for each pair of (a, b); the capacity of buffers M
a
for a ∈ {1, ⋯, k}, a parameter ηa,b for each pair of (a, b); set
For each pair of (a, b), do:
Initialize:
For t = 1, ⋯, Ta,b, do
Obtain a training ontology vertex (v t , y t );
If y
t
= a, then
End If
If y t = b, then
End If
End For
End For
In this section, we mainly present the algorithm for ontology sparse vector learning.
Let
In the online ontology learning setting, the algorithm gets an ontology sample z t = (v t , y t ) at a time, and at time t, the vector is updated in an online way with an ontology sample z t ∈ S drawn randomly
where positive number η is the rate of learning and L′ (βt-1, z t ) ∈ ∂ β t L (βt-1, z t ) is a subgradient of the ontology loss function with respect to β t . If L (β, ·) is differentiable at β, we have ∂ β L (β, ·) = {∇ β L (β, ·)}.
For each
where T (β,
The easiest process can be stated as follows: input β0, g0 (base gravity parameter), K and η; for each iterative (from 1 to K), randomly and uniformly draw z
t
from S, and do operations as (1); at last, return
Let
Furthermore, let V
K
be a random subsample of {1, ⋯, n} with K ontology samples, and
where the probability P* is with respect to the random subsampling of V
K
. Let
Let positive integer n
K
be the number of truncated bursts, and
if i satisfies
For threshold π0 ∈ [0, 1], the collection of stable features with gravity parameter
We always write
In this fashion, the ontology sparse vector learning iterative algorithm can be described as follows. Input β0, g0,
If we consider the adaptive base gravity g0 for a optimal rejection rate w
s
∈ [0, 1] is expressed as
Finally, our ontology sparse vector learning algorithm can be stated as follows. Input ontology sample data S, w0, γ(it called annealing rate in machine learning), π0, K, n
K
, M, η, deduce g0,s (w
s
) according to (5); for all m ∈ {1, ⋯, M} and τ = 1, ⋯, n
K
, get determine
At last, return
In this paper, we introduce the fundamental ideas of ontology learning algorithm in multi-dividing setting and sparse vector learning setting. Also, the fundamental of statistical learning theory used in ontology learning is introduced. Finally, we design two ontology learning algorithms: confidence weighted multi-dividing ontology algorithm and ontology sparse vector learning via feature selection.
Footnotes
Acknowledgments
We thank the reviewers for their constructive comments in improving the quality of this paper. This work was supported in part by the National Natural Science Foundation of China (51574232), the Open Research Fund by Jiangsu Key Laboratory of Recycling and Reuse Technology for Mechanical and Electronic Products (RRME-KF1612) and the Natural Science Foundation of Jiangsu University of Technology in China (KYY14013).
