Mathematical formula information retrieval system

Abstract

Design and implementation of the system for retrieving information about mathematical formulas – MFIRS. The structure of the system is mainly divided into the modules: input normalization, mathematical formula unification, mathematical formula encoding, text information feature extraction, mathematical formula feature extraction, mathematical formula indexing, retrieval and ranking. A method for extracting mathematical formulas and keywords based on FastText word embedding technology is proposed. This method can be used not only to get the structural features of the formula, but also to facilitate the calculation of the similarity of the formula by the vector result. At the same time, the model introduces the semantic features of context-rich mathematical formulas to improve the domain correlation of search results. The MathRetEval dataset was created based on about 7.9 $\times$ 10 ${}^{5}$ arXiv documents and about 1.5 $\times$ 10 ${}^{8}$ mathematical formulas. The scalability of the system is verified using this data set. The mathematical formulas can be written in the language TEX or MathML. When queried in the TEX language, it can be converted to a tree representation of the MathML representation and then indexed. This MFIRS is an information retrieval system for mathematical formulas with the features of mathematical perception, which can use the search for the similarity of partial formulas.

Keywords

Mathematical formula index retrieval mathematical content representation document sorting retrieval engine

1. Introduction

1.1 Background

If you search the digital library, you can find much of what you need. The common search technology is mainly focused on the search for simple texts. Text documents in the form of word bags, do not support the processing of mathematical formulas.

The scientific literature is full of indices and complex mathematical formulas, even in the basic metadata, titles, and abstracts of papers. Research experience with Google Scholar has shown that failure to include mathematical formulas in references can lead to serious search problems.

1.2 The current situation and significance of the country

The standard for mathematical exchange between related software tools is MathML of the W3C. Few people want to write MathML directly. Most prefer some sort of compact symbol in the style of TeX, such as LaTeX or AMSLaTeX. As a result, the mathematical query system allows users to use their preferred symbols, such as the TEX package or similar (AMS) LATEX, for querying. To meet the varied retrieval preferences of users, the data should be converted into a uniform format. Presented MathML or Content MathML is only used for the output of software systems.

In the search for scientific and technical literature, the unsolved problem of mathematical search becomes very clear and arouses great interest, as the system that does not support the search for mathematical formulas is not perfect. Therefore, we evaluated the currently popular mathematical retrieval systems, including MathDex [1], MathFind [2], EgoMath [3], Egothor [4], LatexSearch [5], LeActiveMath [6], MathWebSearch [7], MSE [8], FSE [9], ICST [10], IFISB [11], KWARC [12], MCAT [13], MIRMU [14], RIT (Rochester Institute of Technology) [15], TUW [16], et al. Table 1 compares the aforementioned mathematical retrieval systems.

Table 1
Comparison of mathematical retrieval systems

Athematical retrieval system name	Mathematical query language	Supports mathematical expression formats	Index method	Web-based or not	Framework prototype
MathDex	MathML (Presentation)	MathML. LaTex. OpenMath, infix, etc	Text retrieval Index method	Not	No
DLMF Search	LaTex	TeX/LaTex	Text retrieval Index method	Not	Lucene
LeActive Math	OpenMath	OMDoc	Text retrieval Index method	Yes	Lucene
EgoMath	LaTex	MathML, PDF	Text retrieval Index method	Yes	Egothor v2
MathWeb Search	MathML (Content)	Content MathML, OpenMath, Restricted Presentation MathML	Displacement tree indexes	Yes	No

2. System design and implementation

The developed system splits the index content into a mathematical formula index and an ordinary text index when indexing XHTML, HTML, and other documents. The indexing methods for the two types of content are different; ordinary text indexes are indexed in the conventional way.

The Fig. 1 below shows the overall architecture of the system.

Figure 1.

The overall architecture of the system.

The index module of the realized system is mainly used for normalizing the input and preprocessing the mathematical formula.

2.1 Input normalization

The MathML document is normalized using the UMCL toolset to avoid the problem of mathematical formulas with the same semantics being represented by different MathML symbols. MathML documentation is normalized using the UMCL toolset to prevent issues with mathematical formulas that have the same semantics but are represented by different MathML symbols. To effectively align the MathML index form and query and improve the fairness of document similarity evaluation, standardizing MathML is used in the index and query phases.

The MathML specification form generated with the UMCL transformation not only improves the fairness of the similarity evaluation, but also helps in adapting queries to the indexed form of MathML.

2.2 Mathematical formula unification

There are three different types of unified algorithms used by the system. In order to obtain multiple common representations of all formulas, the algorithm performs a tokenization process.

The system returns matches similar to user queries, preserving formula structures and alpha-equations.

The three types of unified algorithms are shown below.

(1)
Sort Unify: Sort Unify by the number of operations that can be substituted.
(2)
Unify Variable: Replace all variables with a unified symbol (ID) and retain the mandatory variables.
(3)
Unify constant: Replace all numeric constants with a unified symbol (constant).

2.3 Coding mathematical formulas

Common formulas include MathML, LaTeX, MTEF, OpenMath, binary trees, multi-member trees, and infix formats.

To preserve the hierarchical information of mathematical formulas and the connections between mathematical symbols, MFIRS uses the N-ary tree encoding method for mathematical formulas based on the representative MathML format. After mapping the two-dimensional structure of the formula using an N-tree, the formula must be converted into a linear sequence. Then, the “word structure” of this new language must be defined, using a single symbol node as the substructure of the word that resolves the formula. And here, a pair-based structure is used to define the word.

Since there are currently several formats for describing mathematical formulas, it is necessary to convert these formulas into a uniform format for memory processing in order to meet the requirements of retrieval. In future versions of MFIRS, the key implementation of the mathematical formulas will be converted into a unified format and stored in a big data system for retrieval.

2.4 Extract text information features

The text information is extracted at the following level: keyword level, paragraph level and document level, which mainly contains titles, summaries and keywords, as well as the description of the mathematical formula.

The keyword extracts the most relevant set of words from the text, which induces the text content, and can reduce the impact of irrelevant content in the context on the information acquisition process. In this work, the Rapid Automatic Keyword Extraction (RAKE) algorithm is used to process the context of mathematical formulas. Considering the co-occurrence relationship between word frequency and words, each word in the candidate phrase can be assigned a score. For the word w, it is scored as follows Eq. (1):

$\displaystyle\text{Score}\left(w\right)=\frac{\textit{Deg}\left(w\right)}{% \textit{Freq}\left(w\right)}$ (1)

Where $\textit{Deg}\left(w\right)$ is the number of common occurrences of the word w; that is, the number of occurrences of any word and $w$ in the same candidate phrase. $\textit{Freq}\left(w\right)$ is the number of occurrences of the word w in the text to be extracted.

The final process of extracting the text keywords around the formula is as follows:

(1)

Segmenting text into words, for the 10 sentences surrounding a given formula, based on punctuation marks and stop words as criteria for segmentation.

(2)

Build a co-occurrence matrix of text.

(3)

Extract the three characteristics of word frequency, word degree, and the ratio between word degree and word frequency.

(4)

Score words based on the formula: Score $=$ $\textit{Deg}\left(w\right)$ / $\textit{Freq}\left(w\right)$ .

(5)

Sort documents by vocabulary score in descending order.

The degree of a word refers to the number of times the word appears in other phrases, and the degree value divided by the frequency of the word is the score of the word. It then iterates through all the candidate phrases, sums the total scores of all the words that make up the phrase, and sorts the candidate phrases in order of their scores. The top 38.2% of the total score in the sorted results is used as the final keyword that rakes extracts.

This article uses the FastText model to train word embeddings on English text words around formulas. For text features containing mathematical content, the training process is improved, enabling it to capture the characteristics of professional vocabulary.

(1) Choice of stopwords

In the general text corpus, some common functional words belong to qualifiers, such as “the”, “a”, “an”, “that”, “those”, and so on. These words are referred to as stop words and are removed from the corpus. Some stop words that may affect mathematical semantics, such as “by”, “allow”, “near”, “everywhere”, etc., are calculated according to Eq. (2) to calculate the probability of deletion.

$\displaystyle\text{P}\left({w_{i}}\right)=\max\left({1-\sqrt{\frac{t}{\text{f}% \left({w_{i}}\right)}},0}\right)$ (2)

Where $\text{f}\left({w_{i}}\right)$ is the ratio of the number of words $w_{i}$ in the dataset to the total number of words, and the constant t is the set threshold (set to $10^{-4}$ in the experiment). It can be seen that the word $w_{i}$ is discarded only when $\text{f}\left({w_{i}}\right)>t$ , and the higher the word frequency, the greater the probability of being discarded.

(2) Apply negative sampling technology

In the traditional CBOW (continuous bag of words) model, the negative sampling technique is applied to predict the word w through its context, so for all given contexts Context (w), the word w is a positive sample, and other words are negative samples. Generally, 10 negative sample words will be selected for each context, and the selected words follow the unigram model, meaning that the probability of the selected word is equivalent to the frequency of the word in the corpus. The calculation formula is shown in Eq. (3):

$\displaystyle\text{P}\left({w_{i}}\right)=\frac{f\left({w_{i}}\right)^{3/4}}{% \mathop{\sum}\nolimits_{j=0}^{n}\left({f\left({w_{i}}\right)^{3/4}}\right)}$ (3)

For both the built positive samples and the selected number of negative samples, the goal of the model is to maximize the following functions Eq. (4):

$\displaystyle\text{F}=\mathop{\sum}\nolimits_{t=1}^{V}\left[{\log\left({1+e^{-% \text{s}\left({w_{t},w_{C}}\right)}}\right)+\mathop{\sum}\nolimits_{n\in N_{t,% c}}\log\left({1+e^{\text{s}\left({n,w_{C}}\right)}}\right)}\right]$ (4)

$\displaystyle\text{s}\left(w_{t},w_{C}\right)=\frac{1}{\left|w_{c}\right|}% \mathop{\sum}\nolimits_{\gamma^{\prime}\in w_{C}}u_{w_{t}}^{N}v_{w_{t}}$

Where $V$ is the length of the vocabulary, $\text{s}\left({w_{t},w_{C}}\right)$ is the probability that the context $w_{C}$ and the target word $w_{t}$ are positive samples, and $N_{t,c}$ represents the set of negative samples corresponding to the current context and the target word. $v_{w_{t}}$ represents the word vector $w_{t}$ , while the context vector is obtained by averaging the word vector $u_{w_{t}}^{N}$ for each word $\gamma^{\prime}$ in the window.

By using the negative sampling technique, the efficiency of the model weight update is improved, and the time complexity of calculating the negative sample is effectively reduced.

2.5 Extract mathematical formulas features

Mathematical formulas express logical semantics through the combination of mathematical symbols and varied spatial relationships. This system uses an N-ary arithmetic tree to represent the structure and hierarchy information of the formula and traverses the mathematical formula in depth using the N-ary arithmetic tree to convert it into a linear sequence. An analytical method based on the operational structure is designed to generate a substructure sequence of mathematical formulas. In the process of parsing formulas, we normalize and reconstruct arithmetic trees to reduce the occurrence of synonyms. Then we use the word embedding technique, FastText model, to learn the feature representation vector of mathematical formula substructures.

Considering the influence of the hierarchy and frequency of the substructure on the feature representation of the mathematical formula, we use the Smooth Inverse Frequency (SIF) method to assign weights to the substructure of the mathematical formula and calculate the feature vector based on these weights. The mathematical formula feature extraction method proposed in this section not only retains the formula hierarchy and structural features but also avoids the problem of high complexity in tree structure matching.

Input: formula sequence $F$ , Formula node pair tuple set $V$ , The embedding vector $\left\{{v_{t}:t\in V}\right\}$ of the formula node to tuple Hyper parameters $\alpha,\beta$ .

Output: Formula embedding vector $\left\{{v_{f}:f\in F}\right\}$

1: for tuple t in $V$ do

2: $s_{t}\leftarrow$ frequency of occurrence of point-to-word t

3: $p_{t}\leftarrow\frac{s_{t}}{\left|V\right|}$

4: end for

5: for formula sequence f in F do

6: $D_{f}\leftarrow$ depth of the formula tree

7: for tuple t in f do

8: $d_{t}\leftarrow$ depth where Tuple T is located

9: The total number of tuples in the layer where the $L_{t}\leftarrow$ tuple t is located

10: $w_{t}\leftarrow\frac{D_{f}-d_{t}}{D_{f}}\times\alpha+\frac{L_{t}}{\left|f% \right|}\times\beta$ ,where $\alpha+\beta=1,?0\leqslant\alpha\leqslant 1,0\leqslant\beta\leqslant 1$

11: end for

12: $v_{f}\leftarrow\frac{1}{\left|f\right|}\mathop{\sum}\limits_{t\in f}\frac{% \gamma}{\gamma+p_{t}}v_{t}w_{t}$

13: end for

14: Construct matrix X with column values $\left\{{v_{f}:f\in F}\right\}$

15: Set $u$ as the first singular value vector of X;

16: for formula sequence f in F do

17: $v_{f}\leftarrow v_{f}-uu^{T}v_{f}$

18: end for

2.6 Index math formula

This system indexes documents, paragraphs, and mathematical formulas, and the index data is placed in Tables 2 and 3. The index model data structure uses an abstract tree inverted index; that is, a tree structure combined with an inverted table. Use the form of an abstract tree to index the formula; that is, describe it by the level of the tree. The closer to the root node, the more abstract it is, and vice versa. An inverted index of an abstract tree requires the mathematical formulas to be described abstractly. This abstract tree index model can perform both abstract and structural queries on mathematical formulas.

Table 2
Formula index table

Index number	Index content	Equivalent formula set
1	A	Al, A2, A3 $\ldots$
2	B	B1, B2, B3 $\ldots$
$\ldots$	$\ldots$	$\ldots$
36	M	M1, M2, M3 $\ldots$
$\ldots$	$\ldots$	$\ldots$

Table 3

Formula document corresponds to the table

Formula number	Formula content	The position of the formula in the document	filename
1	A1	OD1(22, 28) $\ldots$	D1, D2, D3 $\ldots$
2	A2	OD7(96, 103) $\ldots$	D2, D4, D8 $\ldots$
$\ldots$	$\ldots$	$\ldots$	$\ldots$
53	N2	ODn(157, 169) $\ldots$	D8, D11, D12 $\ldots$
$\ldots$	$\ldots$	$\ldots$	$\ldots$

Suppose we extract the formulas M1 and M2 from the document and their normalized standard form is M (assuming that the index entry already exists), then we can directly insert them into the equivalent formula set with index content M on the index table, as shown in Table 2. In this way, when the user queries, if they search for formula M3, it is normalized to obtain its standard form M, abstracted and matched in the abstract tree, the match M is found, and then the equivalent formula set corresponding to M is found in the index table. Then, it finds the document where these formulas are located according to the formula document corresponding table (as shown in Table 3) and returns the result to the user.

2.7 Retrieval and Ranking algorithm

MFIRS mainly utilizes keyword extraction and keyword feature learning algorithms to obtain key information in the text surrounding the formula, reduce the interference of irrelevant content in the text on the similarity calculation, and further improve the accuracy of the model retrieval. The specific retrieval process of MFIRS is as follows.

(1)
Select 16 pieces of data as the query statement for 16 types of scientific and technological documents.
(2)
Extract keyword phrases using the RAKE algorithm.
(3)
The keywords vector and formula vector are obtained by the keyword feature extraction algorithm and the formula feature extraction algorithm, respectively.
(4)
Compute the vector of query keywords and formulas, then compute the cosine similarity between the keyword vectors and formula vectors of all data sets in the test set, respectively. The calculation formula is shown below in Eqs (5) and (6), and then the query results should be sorted from largest to smallest to obtain the top 10,000 results. Finally, Eq. (7) should be used to rank the documents containing the query results.

$\displaystyle\text{Score}_{f}\left({Q,f}\right)=\frac{\mathop{\sum}\nolimits_{% i=1}^{n}V_{Q_{i}}\times V_{f_{i}}}{\sqrt{\mathop{\sum}\nolimits_{i=1}^{n}\left% ({V_{Q_{i}}}\right)^{2}}\times\sqrt{\mathop{\sum}\nolimits_{i=1}^{n}\left({V_{% f_{i}}}\right)^{2}}}$ (5)

Where the $V_{Q}$ feature representation vector represents the input of the query, and the $V_{f}$ feature representation vector represents the mathematical formula $f$ . Use Eq. (6) to compute the similarity score, $\text{Score}_{t}\left({Q,K}\right)$ , between the keyword of the query and the keyword in the context of Formula $f$ . Through this formula, the top 10,000 formulas closest to the query can be obtained.

$\displaystyle\text{Score}_{t}\left({q,\text{K}}\right)=\frac{\mathop{\sum}% \nolimits_{i=1}^{m}V_{q_{i}}\times V_{K_{i}}}{\sqrt{\mathop{\sum}\nolimits_{i=% 1}^{m}\left({V_{q_{i}}}\right)^{2}}\times\sqrt{\mathop{\sum}\nolimits_{i=1}^{m% }\left({V_{K_{i}}}\right)^{2}}}$ (6)

Where $V_{q}$ represents the keyword feature vector of the query and $V_{k}$ represents a keyword feature vector in a dataset.

Use Eq. (7) to compute the similarity between the query input and Document A in the dataset and their corresponding keywords, then compute the similarity of all documents in the dataset and sort all the resulting documents.

$\displaystyle\text{Score}_{ft}\left({Q,q,K,f}\right)=\theta\cdot\mathop{\sum}% \nolimits_{f\in A}^{v}\text{Score}_{f}\left({Q,f}\right)+\varphi\cdot\mathop{% \sum}\nolimits_{t\in A}^{w}\text{Score}_{t}\left({q,\text{K}}\right),$ (7)

$\displaystyle\theta+\omega=1,$ $\displaystyle 0\leqslant\theta\leqslant 1,$ $\displaystyle 0\leqslant\omega\leqslant 1,$

The default value of $\theta$ and $\omega$ is 0.5, or the user can use different values depending on the actual situation.

3. Experimental evaluations of the system

In this paper, a large-scale evaluation is achieved with the aid of a mathematical text repository.

3.1 Dataset

To evaluate our system, we built a library of mathematical texts called MathRetEval. We downloaded the document from arXiv.org, where the TeX document was converted into an XML document. For representing mathematical formulas, the W3C standard MathML is used. The documents used come from various scientific fields. For example, mathematics and physics, computer science, statistics, quantitative finance, and quantitative biology.

In this experiment, we used the corpus dataset from this real mathematical paper to evaluate the performance of this system. First, the arXiv document is converted into content-based XHTML format and presentation-based MathML format. The resulting corpus is decompressed to a size of 260 GB and compressed to a size of 16 GB.

3.2 Test results

The following tests the developed system’s ability to index and retrieve relatively large real scientific literature repositories. The goal is to observe how system parameters affect the index file size, retrieval time, and ranking of retrieved documents. Check the scalability of the system. Under different configurations, the evaluation system has both text and textile retrieval performances.

The entire document set contains 1.5 $\times$ 10 ${}^{8}$ formulas, and after all pre-processing, the system indexes 0.89 $\times$ 10 ${}^{8}$ unique formulas, with an index run time of 1378 minutes (nearly 24 hours), resulting in an index size of approximately 88 GB.

Computing resources and experimental parameters: 512 GB of RAM, 48-core Intel Xeon CPU, Ubuntu v22.04 operating system.

Figure 2.

The results of the scalability test of input documents and indexed formulas.

As shown in Fig. 2, the system’s scalability is approximately linear with the number of documents. This provides a viable response time even for billions of index sub-formulas, and even small formulas can score matches in most documents. We use different complex queries, such as hybrid, non-hybrid, high/low complexity single/multi-formula, and so on, to create an index. Then measure the average query time of the system on the MathRetEval dataset, resulting in an average query time of 512 milliseconds.

Figure 3.

Comparison of index consumption times of various systems.

Figure 4.

Comparison of index sizes of various systems.

As the number of input documents increases, Fig. 3 shows that the index consumption time of each system has an increasing trend. When the number of input documents is fixed, the index consumption time of the LatexSearch system is the longest, and the index consumption time of the MFIRS system is the shortest. When the number of input documents reaches 1.46 million, the indexing time curves of the LeActiveMath, MathWebSearch, and TUW systems appear to overlap. The indexing time of MFIRS shows an approximately linear increase.

As the number of input documents increases, Fig. 4 shows that the index file size of each system tends to increase. Among them, LeActiveMath has the largest index file, followed by MathWebSearch and LatexSearch; TUW is in third place, and MFIRS has the smallest index file. When the number of input documents reaches approximately 12.9 billion, the index file size of LatexSearch, LeActiveMath, and TUW is almost the same.

Figure 5.

Comparison of the average search times of various systems.

As can be seen from Fig. 5, the average search time per system shows an increasing trend with an increase in the number of documents. Among them, the average search time of LeActiveMath and MathWebSearch tends to coincide and is the longest, followed by LatexSearch and TUW. At 2.5 million of the number of documents entered, the average search time for MFIR was around 91 milliseconds, and the average search time for LeActiveMath, MathWebSearch, TUW, and MathWebSearch was all longer than the average search time for MFIR.

Figure 6.

Comparison of the number of index formulas between each system.

As can be seen from Fig. 6, as the number of input documents increases, the number of index formulas in each system tends to increase. When the number of input documents is fixed, the LeActiveMath system has the most index formulas, while the MFIRS system has the fewest. The index formula curves of LatexSearch, TUW, and WebSearch overlap in many places, indicating that the number of index formulas of the three systems is essentially the same.

The proposed formula retrieval approach has been tested using multiple configurations on various standard datasets. In this section, we describe the three major sets of experiments used for evaluating and improving our model.

On the designed dataset, the proposed algorithm system is compared to the current popular mathematical formula retrieval system. The experimental metrics used include precision@k, recall@k, nDCG@k, F1@k, and nDCG@k.

Define the following points: TP (True Positive): Indicates that the true category of the sample is positive and the final predicted result is also positive. FP (False Positive): A false positive indicates that the true category of the sample is negative, but the final predicted result is positive. TN (True Negative): A true negative indicates that the true category of the sample is negative and the final predicted result is also negative. FN (False Negative): A false negative indicates that the true category of the sample is positive, but the final predicted result is negative.

Precision@k is the proportion of correctly predicted relevant results out of the returned k results, which quantifies how many items in the top k results are relevant. Mathematically, $P r e c i s i o n @ k$ is shown by the following Eq. (8).

$\displaystyle\textit{Precision@k}=\frac{\textit{TP@k}}{\textit{TP@k}+\textit{% FP@k}}$ (8)

where $\textit{P@k}+\textit{FP@k}=k$ , then the value range of Precision@k is [0,1], with a larger value being better.

The Precision@k values of each system are shown in Figs 7–10.

Figure 7.

Comparison of various systems on the P@5 metrics.

Figure 8.

A comparison of various systems based on the Precision@10 metric.

Figure 9.

Comparison of Systems on Precision@15 metrics.

Figure 10.

Comparison of systems on Precision@20 metrics.

As can be seen from Figs 7–10, the systems with higher and comparable accuracy include MathWebSearch, MCAT, TUW, RIT, and MFIRS. Relatively less accurate and comparable systems include KWARC, MIRMU, LeActiveMath, LatexSearch, Egothor, EgoMath, MathFind, and MathDex. The MFIRS system has the highest accuracy and the MathDex system has the lowest accuracy. The accuracy of all systems is greater than 0.8. A limitation of Precision@k is that it does not take into account the location of correlation results.

Recall@k is the proportion of correctly predicted correlation results to all relevant results, and the calculation Eq. (9) is shown below.

$\displaystyle\textit{Recall@k}=\frac{\textit{TP@k}}{\textit{TP@k}+\textit{FN@k}}$ (9)

where $\textit{TP}+\textit{FN}$ is equal to the number of positive samples, the range of Recall@k is [0,1], with a larger value being better. The Recall@k values of each system are shown in Figs 11–14.

Figure 11.

Comparison of systems on Recall@5 metrics.

Figure 12.

Comparison of systems on Recall@10 metrics.

Figure 13.

Comparison of systems on Recall@15 metrics.

Figure 14.

Comparison of systems on Recall@20 metrics.

The recall rate of each system is shown in Figs 11–14.

It can be seen that the recall rate of each system is above 0.81; the MFIRS system has the highest recall rate, and the MathDex system has the lowest. Similar systems with higher recall include MathWebSearch, MCAT, TUW, and RIT. The relatively low recall includes KWARC, MIRMU, LeActiveMath, LatexSearch, Egothor, EgoMath, and MathFind.

F1@K is the harmonic mean of Precision@k and Recall@k, and the Eq. (10) is shown below.

$\displaystyle\textit{F1@k}=2\times\frac{\textit{Precision@k}\times\textit{% Recall@k}}{\textit{Precision@k}+\textit{Recall@k}}$ (10)

The range of values for F1@k is [0,1], with a higher value being preferable.

The results of the comparison of the F1@K values of each system are shown in Figs 15–18.

Figure 15.

Comparison of systems on $F_{1}$ @5 metrics.

Figure 16.

Comparison of systems on $F_{1}$ @10 metrics.

Figure 17.

Comparison of systems on $F_{1}$ @15 metrics.

Figure 18.

Comparison of systems on $F_{1}$ @20 metrics.

The F1 value of each system exceeds 0.8, as can be seen in Figs 15–18. The F1 values of MathWebSearch, MCAT, TUW, RIT, and MFIRS systems are higher and similar, and the F1 value of the MFIRS system is the highest. The KWARC, MIRMU, LeActiveMath, LatexSearch, Egothor, EgoMath, MathFind, and MathDex systems had relatively low F1 scores, and the MathDex system had the lowest F1 score.

There is also a variant of F1, as shown in Eq. (11):

$\displaystyle F_{\beta}=\frac{\left({1+\beta^{2}}\right)\times\left({\textit{% precision}\times\textit{recall}}\right)}{\left({\beta^{2}\times\textit{% precision}+\textit{recall}}\right)}$ (11)

$F_{0.5}$ means the recall is half of the precision, $F_{2}$ means the precision is half of the recall, and $F_{1}$ means the precision and recall are equal. $F_{0.5}$ and $F_{2}$ are two common settings for the F value.

The PR curve is often used in the field of information extraction, and it can be used to replace the ROC when the class distribution in the dataset is not balanced. The horizontal and vertical axes of the PR curve represent precision and recall, respectively. Accuracy is the ratio of the actual number of positive samples to the predicted number of positive samples.

Figure 19.

Comparison of the average accuracy of each system.

The ROC curve is similar to the PR curve. When comparing multiple models, the model with better performance will be closer to the upper right corner of the corresponding ROC or PR curve.

Figure 20.

ROC graph of each system.

The average accuracy of each system is shown in Fig. 19. It can be concluded from Fig. 20 that the average precision of the MathWebSearch, RIT, TUW, MCAT, KWARC, MIRMU, LeActiveMath, LatexSearch, and EgoMath systems decreases sequentially. The MathDex system has the lowest average precision. MathFind has almost the same average precision as the Egothor system. The average precision of the MFIRS system is the highest, and the corresponding area under the PR curve (Precision & Recall@k) is the largest.

As can be seen from Fig. 20, the ROC curve of MFIRS almost contains the ROC curves corresponding to other systems. Therefore, MFIRS has the best performance, and the other ROCs are ranked from highest to lowest according to their advantages, followed by MathWebSearch, RIT, TUW, MCAT, KWARC, MIRMU, LeActiveMath, and LatexSearch. EgoMath and MathFind converge at FPR $=$ 0.6, indicating that the performance of these two systems is almost the same, while the ROC of MathDex is at the bottom.

As an international general mechanism for evaluating search algorithms, MRR averages the reciprocal rankings of multiple search matches; that is, the first match has a score of 1, the second match has a score of 0.5, and the nth match has a score of 1/n. If there is no matching sentence, the score is 0, and the final score is the sum of all scores. Therefore, the MRR assessment is based on the only relevant outcome. The formula for calculating the MRR Eq. (12) is shown below.

$\displaystyle\textit{MRR}=\frac{1}{\left|Q\right|}\mathop{\sum}\nolimits_{i=1}% ^{\left|Q\right|}\frac{1}{\textit{rank}_{i}}$ (12)

Where $\textit{rank}_{i}$ represents the rank of the first correct match for the i-th search term.

The Binary Preference (Bpref) measure ignores unrated matches and quantifies the ability of the ranking method to rank judged-relevant matches higher than irrelevant ones. For a given query with multiple related documents R and irrelevant documents N, we define NonRelevant (r) as the number of irrelevant elements in the rank of the relevant match r. Then the definition of Bpref is shown in the following Eq. (13).

$\displaystyle\textit{Bpref}=\frac{1}{R}\mathop{\sum}\nolimits_{r}\left[1-\frac% {\textit{min(NonRelevant}(r),R)}{\textit{Min}(N.R)}\right]$ (13)

Both precision and recall can only measure one aspect of retrieval performance, and the ideal situation is for both precision and recall to be relatively high. When we want to improve the recall rate, it will affect the accuracy rate, so the accuracy rate can be regarded as a function of the recall rate, namely: $P=f(R)$ , where $P$ is the accuracy rate and $R$ is the recall rate. As the recall rate increases from 0 to 1, the accuracy rate changes. Then the function $P=f(R)$ can be integrated over $R$ . The AveP Eq. (14) is as follows:

$\displaystyle\textit{AveP}=\mathop{\smallint}\nolimits_{0}^{1}P\left(r\right)% dr=\mathop{\sum}\nolimits_{k=1}^{n}P\left(k\right)?\left(k\right)=\frac{% \mathop{\sum}\nolimits_{k=1}^{n}\left({P\left(k\right)\times rel\left(k\right)% }\right)}{\textit{number of relevant documnets}}$ (14)

where rel(k) indicates If the k-th document is relevant, rel(k) is 1; otherwise, it is 0. $P(k)$ indicates the accuracy of the first $k$ results documents. The formula for calculating AveP Eq. (15) is as follows:

$\displaystyle\textit{AveP}=\frac{1}{R}\times\mathop{\sum}\nolimits_{r=1}^{R}% \frac{r}{\textit{position}\left(r\right)}$ (15)

In the result list, the rth-related document position from front to back is denoted by “position (r)”. The total number of related documents is denoted as R.

The meaning of AveP is to add a P at each R position while gradually increasing the recall rate from 0 to 1, thus ensuring a relatively high accuracy rate and making the final AveP relatively large.

Usually, multiple query statements are used to measure the performance of the retrieval system, so the mean of average precision scores should be calculated for the Average Precision (AveP) of multiple query statements; that is, the Mean Average Precision (MAP) calculation Eq. (16) is shown below.

$\displaystyle\textit{MAP}=\frac{\mathop{\sum}\nolimits_{q=1}^{Q}\textit{AveP}% \left(q\right)}{Q}$ (16)

The average accuracy for a topic is obtained by averaging the accuracies of individual documents related to the topic.

Figure 21.

Comparison of systems on mean reciprocal rank metrics(MRR).

Figure 22.

Comparison of systems on Bpref metrics.

Figure 23.

Comparison of systems on MAP metrics.

The mean accuracy of the individual subjects is the mean accuracy of the master set (MAP). The system’s retrieval performance for all relevant documents is measured by the single-value index MAP. During the retrieval process of the system, if no relevant documents are returned, the precision is zero. If the returned document is ranked higher, the MAP value will be greater.

The measurement standard for MAP is relatively simple. The relationship between query (q) and retrieved document (d) is either 0 or 1. The core is to use the position of the relevant data corresponding to Q to perform an accurate evaluation of the sorting algorithm. It should be noted that when using the evaluation of MAP, one needs to know: how many words are related to each query; the positions of these words in the sorting result.

The index values of MRR, Bpref, and MAP for each system are shown in Figs 21–23, respectively.

As can be seen in Figs 21–23. MFRS has reached the highest value in MRR, Bpref, and MAP indicators. MathWebSearch, TUW, RIT, and MCAT are ranked second, third, fourth, and fifth respectively in these three indicators. KWARC, MiRMu, LeActiveMath, and LatexSearch ranked ahead of Egothor, EgoMath MathFind, and MathDex systems.

According to the relevance of documents, the nDCG index can be used for multi-level scoring. The MAP metric that relates to the scoring of both relevant and irrelevant documents can be used.

The calculation Eq. (17) is as follows:

$\displaystyle\textit{CG}_{p}=\mathop{\sum}\nolimits_{i=1}^{p}\textit{rel}_{i}$ (17)

The relevance level of the document is denoted by $\textit{rel}_{i}$ , where the “i” represents the ith document. The larger the $\textit{rel}_{i}$ value, the higher the correlation, and if $\textit{rel}_{i}$ is equal to zero, it means no correlation.

Due to the lack of sensitivity to positional information in the calculation of $\textit{CG}_{p}$ , the results of the calculation are not accurate. For example, three documents are retrieved with the relevance of {3, $-$ 1, 1} and { $-$ 1, 1, 3}; obviously, the previous sorting is better, but their CG is the same. To introduce the metric calculation of the location information, both the relevance level of the document and the location where it is located should be considered. Assuming that each position is sorted from smallest to largest, and their values decrease in succession. For example, it can be assumed that the value of the i-th position is $\text{log}_{2}\left({i+1}\right)$ , then the benefit calculation Eq. (18) generated by the document in the i-th position is as follows:

$\displaystyle\textit{rel}_{i}\times\frac{1}{\text{log}_{2}\left({i+1}\right)}=% \frac{\textit{rel}_{i}}{\text{log}_{2}\left({i+1}\right)}$ (18)

The Eq. (19) for calculating DCG (Discounted Cumulative Gain) is as follows:

$\displaystyle\textit{DCG}_{p}=\mathop{\sum}\nolimits_{i=1}^{p}\frac{\textit{% rel}_{i}}{\text{log}_{2}\left({i+1}\right)}=\textit{rel}_{i}+\mathop{\sum}% \nolimits_{i=1}^{p}\frac{\textit{rel}_{i}}{\text{log}_{2}\left({i+1}\right)}$ (19)

Another commonly used DCG calculation Eq. (20) is used to increase the proportion of the correlation’s influence as follows:

$\displaystyle\textit{DCG}_{p}=\mathop{\sum}\nolimits_{i=1}^{p}\frac{2^{\textit% {rel}_{i}}-1}{\text{log}_{2}\left({i+1}\right)}$ (20)

IDCG (Ideal DCG) is the ideal DCG for a query statement and p; the maximum value of the DCG is calculated using Eq. (21):

$\displaystyle\textit{IDCG}_{p}=\mathop{\sum}\nolimits_{i=1}^{\left|{\textit{% REL}_{p}}\right|}\frac{2^{\textit{rel}_{i}}-1}{\text{log}_{2}\left({i+1}\right)}$ (21)

Where REL indicates that the documents are sorted in the most relevant way, and the first p documents are taken to form a set.

Figure 24.

Comparison of systems on nDCG@5 metrics.

Figure 25.

Comparison of systems on nDCG@10 metrics.

Figure 26.

Comparison of systems on nDCG@15 metrics.

Figure 27.

Comparison of systems on nDCG@20 metrics.

The length of the result document set retrieved by each query statement varies in nDCG (Normalized DCG). The reason for normalizing the DCG for different query statements is that it cannot be averaged, and the calculation of the DCG is significantly impacted by differences in $P$ -values.

The calculation Eq. (22) uses IDCG to normalize nDCG, indicating the distance between IDCG and DCG.

$\displaystyle\text{nDCG}_{p}=\frac{\text{DCG}_{p}}{\text{IDCG}_{p}}$ (22)

In this way, the $\text{nDCG}_{p}$ value of each query statement ranges from 0 to 1, allowing different query statements to be compared, and the average $\text{nDCG}_{p}$ of multiple query statements to be calculated.

NDCG@k is the Normalized Discounted Cumulative Gain (NDCG), which is an evaluation index that takes into account the order of the returns. The range of values is [0,1], the higher the value, the better the effect. For example, NDCG@10 and NDCG@20 represent nDCG when p is between 10 and 20, respectively.

The nDCG@k indicator values of each system are shown in Figs 24–27.

The nDCG@k values of the KWARC MIRMU, LeActiveMath, LatexSearch, Egothor, EgoMath, MathFind, and MathDex are all lower than those of the nDCG@k values of the MathWebSearch, MCAT, TUW, RIT and MFIRS systems, as can be seen in Figs 24–27.

As shown in Figs 24–27, the MFIRS system has the highest nDCG@k value, and the MathDex system has the lowest.

4. Conclusion

A method of mathematical retrieval, indexing, and sorting is proposed and integrated into the implemented system of mathematical formula information retrieval. The feasibility of this method has been verified on the MathRetEval dataset. Verification testing of system scalability is based on this.

Plans for the future are underway to use this technology in digital library projects around the world. The system’s web front end is fully functional and can convert mathematics into presentation MathML for full-text retrieval. The system is well-expanded and has the capability to be used in large digital libraries.

In the future, the system will not only be able to retrieve mathematical formulas of representational MathML but also retrieve mathematical formulas of content-based MathML.

Footnotes

Acknowledgments

The authors acknowledge the Anhui Province Excellent Talent Training Project (Research and Application of Multi-instance Learning Algorithm (gxyq2018107); Key Project of Anhui Provincial Department of Education (KJ2020A0744); Bengbu University High-level Talent Startup Project (BBXY2018KYQD07).

References

Maqoqa

. Exploring the effects of technology integration in the learning and teaching of Mathematics. International Journal of Research in Business and Social Science (2147-4478). 2023; 12(2): 407-15. doi: 10.20525/ijrbs.v12i2.2386.

Koprubasi

Mohapatra

. Inverse scattering problem with Levinson formula for eigenparameter – dependent discrete Sturm-Liouville equation. Mathematical Methods in the Applied Sciences. 2022; 46(2): 1466-78. doi: 10.1002/mma.8590.

Zhai

Huang

. Instance selection for big data based on locally sensitive hashing and double-voting mechanism. Advances in Computational Intelligence. 2022; 2(2). doi: 10.1007/s43674-022-00033-z.

Dambrine

Pierre

. Continuity with respect to the speed for optimal ship forms based on Michell’s formula. Mathematical Control and Related Fields. 2023; 13(1): 63-93. doi: 10.3934/mcrf.2021049.

Dowerah

Mukherjee

. Weaving phase retrieval and weaving norm retrieval. International Journal of Applied and Computational Mathematics. 2022; 8(4). doi: 10.1007/s40819-022-01419-w.

Jain

Prathik

Vinayakarao

Purandare

. A search system for mathematical expressions on software binaries. Proceedings of the 15th International Conference on Mining Software Repositories. 2018; doi: 10.1145/3196398.3196413.

Jain

Prathik

Vinayakarao

Purandare

. A search system for mathematical expressions on software binaries. In: Proceedings of the 15th International Conference on Mining Software Repositories. New York, NY, USA, ACM; 2018.

Pradhan

Zhang

Bethard

Chen

. Embedding user behavioral aspect in TF-IDF like representation. 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). 2018; doi: 10.1109/mipr.2018.00061.

Davila

Agarwal

Gaborski

Zanibbi

Ludi

. Accessmath: Indexing and retrieving video segments containing math expressions based on visual similarity. 2013 IEEE Western New York Image Processing Workshop (WNYIPW). 2013; doi: 10.1109/wnyipw.2013.6890981.

10.

Zhou

Yang

. Math expression image retrieval via attention-based framework. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). 2019; doi: 10.1109/ictai.2019.00044.

11.

Greiner-Petter

Schubotz

Breitinger

Scharpf

Aizawa

Gipp

. Do the math: Making mathematics in wikipedia computable. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022; 1-12. doi: 10.1109/tpami.2022.3195261.

12.

Nassar

Yaakobi

. Array codes for functional PIR and batch codes. 2020 IEEE International Symposium on Information Theory (ISIT). 2020; doi: 10.1109/isit44484.2020.9174459.

13.

Khurana

Laurent

Glass

. Samu-XLSR: Semantically-aligned multimodal utterance-level cross-lingual speech representation. IEEE Journal of Selected Topics in Signal Processing. 2022; 16(6): 1493-504. doi: 10.1109/jstsp.2022.3192714.

14.

Kuang

Kang

. Improving random projections with extra vectors to approximate inner products. IEEE Access. 2020; 8: 78590-607. doi: 10.1109/access.2020.2990422.

15.

Lin

H-Y

Kumar

Rosnes

Graell i Amat