Abstract
Scientific documents, which are majorly constituted of math formulae, form a primary source of scientific and technical information. However, the indexing and the search processes of conventional search engines barely account for mathematical contents of such documents. Though the recent past has witnessed a surge in number of Mathematical Information Retrieval (MIR) systems intending to retrieve math formulae from scientific documents, the low values of their evaluation measures are indicative of the scope for improvement. To cope with the challenges of MIR, and to further the performance of state-of-the-art systems, a novel approach, called Binary Vector Transformation of Math Formula (BVTMF), is introduced. The implemented system extracts MathML formulae from the documents, preprocesses them, and renders them into fairly large-sized binary vectors (vectors of ‘0’s and ‘1’s). Generated formula vector is representative of the information content of corresponding formula. For indexing and searching text contents, the system relies on Apache Lucene. Text and math search results retrieved by independent text and math sub-systems are re-ranked to prioritize the results containing text as well as math components of the user query. Quality of the retrieved search results and appreciable values of the evaluation measures substantiate competence of the proposed approach.
Keywords
Introduction
With increasing research in the fields of Science and Engineering, huge increase in the scientific documents on the web can be experienced. Such documents remarkably differ from normal text documents in terms of their contents. As opposed to normal text documents, scientific documents contain complex formula, and it can be erroneous to index and search such formulae in conventional fashion. Though retrieval systems exists for such documents, fairly modest values of evaluation measures are indicative of the room for improvement.
Following are some of the key challenges of Mathematical Information Retrieval (MIR): First, MathML and TEX count among popular notations used for writing math formulae. However, the contemporary search engines lack specifically designed math editors, which are obligatory for inputting math formulae in these two standard notations. Second, in response to the query (say,
Above-mentioned list of challenges is by no means exhaustive, but it suffices to convey the underlying intricacies. However, over time there have been encouraging signs of continued development, and the domain of MIR has witnessed remarkable growth in the form of a number of sophisticated systems and techniques. In particular, the math aware systems, specifically designed and adapted for the math tasks [1, 19] of series of NII Testbeds and Community for Information access Research (NTCIR) conferences [6, 19], have contributed significantly to the domain. While some MIR systems transform math formulae to text-like representations, the others exploit their raw representation for indexing and searching [9, 13]. For instance, the Math Information Retrieval at Masaryk University (MIRMU) team at NTCIR-10 uses conventional full text searching for searching text as well as math [10]. In order to adapt math formulae to full text searching methods, the formulae are preprocessed. During preprocessing, formulae are canonicalized (to enable search for differently expressed formulae), tokenized (to enable search for sub-formulae) and structurally unified (to enable search for similar formulae). The resultant formulae are transformed to compacted string forms, which can be handled by text-based search engines. Furthermore, the keyword based approach of the BRKLY team at NTCIR-10 transforms XML markup into searchable text and uses conventional full text indexing for capturing variable names present in the formulae [9]. However, abject failure of the system is indicative of the conceptual difference between formula search and classical text-based search. In contrast, the MathWebSearch (MWS) system of KWARC team exploits raw formulae and generates harvest, which comprises of formula in content MathML and text of the document [5]. Eventually, the MWS Formula Indexer performs depth first traversal of Content MathML tree to encode and index the mathematical expressions.
User query in MIR may also contain keywords alongside formula. Keywords often specify the context in which formula is to be searched [8]. An MIR system should include support for matching keywords and formula query against the indexed textual and mathematical contents, respectively. Additionally, matching of keyword (say, Pythagoras’ theorem) against indexed mathematical contents (say, a2 + b2 = c2) and matching of formula query (say, E = mc2) against indexed textual contents (say, “mass–energy equivalence”) should also be supported. To this end, a retrieval system primarily constituted of Text-Text Entailment (TE), Math-Math Entailment (ME) and Text-Math Entailment (TME) modules is proposed in [11]. The three modules (TE, ME and TME) handle text-text, math-math and text-math matchings, respectively.
Although existing systems have addressed majority of the challenges, fairly modest values of their evaluation measures are indicative of the fact that there is ample room for improvement. Thus, it could be interesting to prospect other competent techniques, which could further improve the performance of state-of-the-art MIR systems. To our knowledge, the vector transformation of math formula for MIR is a relatively unexplored research topic – a fact that has motivated us to delve deeper into the topic. A preliminary approach to math formulae embedding uses symbol vectors and Distributed Memory Model of Paragraph Vectors (PV-DM) for embedding formulae [4]. Symbol vectors for formula symbols are obtained using symbol2vec method. A 100 dimensional embedding is generated for each 892 LATEX formula symbols derived from 1,94,150 formula. Thereafter, cosine similarity of vectors is exploited to generate clusters of similar symbols. A document retrieval system [17] for NTCIR-12 MathIR task uses Bag of Words version of Paragraph Vector (PV-DBOW), a form of doc2vec algorithm, to transform 2-d math expression trees into real valued vectors.
In this paper, a novel Binary Vector Transformation of Math Formula (BVTMF) approach to MIR is described. The implemented system captures information content and uses a Bit Position Information Table (BPIT) to generate fairly large-sized binary vectors for the formulae inside documents and the mathematical user query. The extent of similarity between indexed formulae vector and query vector quantifies relevance of search result. Also, instead of reinventing the wheel for text retrieval, the implemented system seeks support of Apache Lucene 1 to index and search text contents. Search results retrieved by the two standalone text and math sub-components are re-ranked in order to prioritize the search results containing text as well as math. Effectiveness of the system has been tested using Wikipedia corpus, and Wikipedia Main Task and Browsing Task queries of NTCIR-12 MathIR task 2 .
Remainder of the paper is structured as follows: Section 2 details system architecture; Section 3 describes experimental setup; Section 4 highlights system results and their analysis; Section 5 concludes the paper and points direction for future research.
System description
This section presents corpus details and architecture of the implemented system, which uses BVTMF approach. The system is characterized by a number of components, which collaboratively interact to furnish intended search results.
Corpus description
The experimentation was carried out on 31,839 MathTagArticles of Wikipedia corpus of NTCIR-12 MathIR task. Scientific texts and a total of 579,608 formulae, encoded using Presentation MathML, Content MathML and TEX, constitute contents of these HTML articles. However, to generate the binary vectors, the BVTMF approach exploits only Presentation MathML encoding and discards the other two encodings. Table 1 summarizes the corpus details.
Corpus Description
Corpus Description
The system architecture, shown in Fig. 1, consists of following major subsystems: Document Handler Query Handler Formula Preprocessor Binary Vector Transformation Module Math Indexer Text Content Indexer Searcher Text-Text Entailment (TE) Module Math-Math Entailment (ME) Module Text-Math Entailment (TME) Module Ranker. Implemented system using Binary Vector Transformation of Math Formula (BVTMF) approach. TQ: Text Query; MQ: Math Query; TR: Text Result; MR: Math Result.

The scientific documents and the user query may contain text as well as math. Therefore, the Document Handler and the Query Handler bifurcate content of documents and queries, respectively, into text and math. Formula Preprocessor removes redundant MathML elements and attributes from document formulae and mathematical user query. Moreover, preprocessing preserves only the Presentation MathML encoding of the formulae. Table 2 shows some examples of preprocessing done by Formula Preprocessor. Original and preprocessed formulae are shown in Fig. 2(a) and 2(b), respectively.
Some examples of preprocessing done by Formula Preprocessor

(a) Original Formula (b) Preprocessed Formula.
After having preprocessed the math formula (document formula or mathematical query), the Binary Vector Transformation Module extracts information content of the formula and uses a Bit Position Information Table (BPIT) to generate binary vector of size 202. BPIT, shown in Figure 3, assigns bit position to several mathematical entities and their positional information (superscript, subscript, etc.). Hence, the BPIT accounts for the entities as well as the syntax of the formula. More specifically, each bit of the binary vector designates either a primary lexical entity (variable or mathematical operator) or special lexical entity (such as lim, log, exp, and so on) or MathML tag or position (superscript, subscript, etc.) of lexical entity in the formula. For each formula, the bits corresponding to its constituent and their syntax information are set to 1, and all other remaining bits are set to 0. For example, consider the math formula y = ex+z. The information content of the formula are undermentioned: Formula comprises of 3 variables, namely ‘x’, ‘y’ and ‘z’. Formula comprises of special lexical entity, namely the exponent (e). Formula comprises of 2 operators, namely ‘=’ and ‘+’. Entities ‘x’, ‘z’ and ‘+’ are present as superscript of the exponent (e). Snapshot of Bit Position Information Table (BPIT). Only some of the entities and their corresponding bit positions are shown. Bit positions 150-201 (not shown in this figure) account for positional information of the entities, i.e. the syntax of formula.

Using above-mentioned information content and BPIT, the Binary Vector Transformation Module generates binary vector for the formula y = ex+z. Fig. 5 exemplifies the vector generation process.
Following key points are worth noting with regard to the BPIT and the vector generation process: Bit positions 0–25, 57–65 and 71–100 correspond to the content of 〈mi〉 tag, which are variables or special lexical entities. Also, the bit positions 26–45, 66–70 and 101–149 correspond to the contents of 〈mo〉 tag, which are mathematical operators. Bit positions 46-56 refer to MathML tags, such as 〈mfrac〉, 〈msqrt〉, etc., which contribute to the semantics of formula. Bit positions 150–175 and 176–201 account for presence of lexical entities in superscript and subscript, respectively. Thus, a total of 65 bit positions are reserved for variable names and special lexical entities, whereas the remaining 137 bit positions are reserved for mathematical operators, MathML tags and positional information of the lexical entities. Semantically similar entities are assigned same bit position. For instance, ‘exp’ and ‘e’ are assigned same bit position equal to 4, and ‘log’ and ‘ln’ are assigned same bit position equal to 88. Moreover, the upper and the lower cases of an entity are also considered equivalent. Unlike the normal variable ‘e’ in the formula “e + f” (for example), the exponential variable ‘e’ is often accompanied by another entity raised to its power. This induces setting of an additional corresponding superscript bit equal to 1. Hence, as the formula vector accounts for entities in the formula as well as the syntax of formula, the difference between the normal variable ‘e’ and the exponential variable ‘e’ is reflected in their respective formula vectors. Multiple occurrences of an entity are not accounted. Only one corresponding bit of the formula vector is set to 1, even if the formula contains multiple occurrences of a given entity. The list of entities is by no means exhaustive, but it accounts for majority of mathematical entities and their positional information. Extending size of the formula vector to account for a comprehensible list of entities and their syntax can be a possible future direction.

Calculation of similarity between an indexed formula vector and a query formula vector. In the two vectors, the blue highlighted bits refer to x, whereas the red highlighted bits refer to y. Also, x=2 and y=3.
Math Indexer indexes a total of 5,79,608 formulae, derived from 31,839 documents. Math Index (of size 787.6 MB), generated by the indexer, stores following information: (i) Preprocessed formula in Presentation MathML notation (ii) Binary vector generated by the Binary Vector Transformation Module (iii) Document which contains the formula.
Implemented system uses the indexer and the searcher modules of Apache Lucene for indexing and searching text contents. Apache Lucene based Text Content Indexer generates full inverted text index for the text contents of the documents.
Searcher module seeks support of ME and TE modules to retrieve math and text search results, respectively. The two modules isolatedly search math and text components of the user query in math and text indices, respectively.
The ME module computes similarity between the indexed formula vector (say, f) and the mathematical query vector (say, q) using Equation (1).
x: matching bit positions in f and q, which are set to 1, and which correspond to the content of 〈mi〉 tag (i.e., variables and special lexical entities).
count(x): count of x
y: matching bit positions in f and q, which are set to 1, and which correspond to the bit positions other than the ones referred by x.
count(y): count of y
k: factor by which y is preferred over x.
max_score: maximum similarity score.
Following points are worth noting about the similarity formula: Justification for using such similarity formula is as follows: Matching set bit positions designate presence of similar entities in the two formulae or similar structure of formulae. Hence, a count of such matching set bit positions quantifies extent of similarity. Also, for comparison of two potential search results, normalization of the similarity score in the range of 0 to 1 (through division by maximum similarity score) is necessary. While computing similarity, y (i.e. the bit positions referring to operators, positional information of the entities and MathML tags such as 〈mfrac〉, 〈msqrt〉, etc.) is preferred over x (i.e. bit positions referring to variable names and special lexical entities). Value of the variable ‘k’ governs the extent of preference. Assigning less preference to variable names ensures retrieval of semantically similar search results. In our experiment, the value of k has been set equal to 4. It should be noted that the value of k = 1 implies that all the bit positions attain equal preference. As mentioned previously, count (x) can attain a maximum value of 65, whereas count (y) can attain a maximum value of 137. Thus, max _ score equals 613 (65+4*137=613). Division by the max _ score yields normalized similarity score in the range 0 to 1. Figure 4 shows the similarity calculated between an indexed formula vector and a query formula vector. Owing to the fact that different bit positions have different weightage, the classical similarity measures, such as Cosine or Euclidean, have not been used to compute similarity between the formulae. Similarity formula satisfies symmetric property of equality, meaning thereby that the similarity between f and q is same as the similarity between q and f (i.e. Similarity (f, q)=Similarity (q, f)). This is attributed to the underlying nature of similarity computation, wherein only the count of matching set bit positions in the two formulae is important, and the order of formulae is trivial. It should be noted that Similarity (f, f) equals 1 if and only if all the bits of formula vector of f are set equal to 1. To put it a different way, the formula f should contain all the entities and structural constructs of BPIT for Similarity (f, f) =1. Also, Similarity (f, f) will be more than Similarity (f, q), for any formula query q different than f. From the aforesaid discussion, it can be deduced that the Similarity (f, q) will be close to 1 if (i) the count of matching set bit positions of f and q formula vectors is high, and (ii) majority of the bits in f and q formula vectors are set equal to ‘1’. As the formulae vectors of f and q in Fig. 4 do not comply with the two criteria, the similarity score is less.
The ME module returns a list of 100 top-ranked math search results, which are ranked based on the similarity score.
In a similar manner, the TE module exploits Text Index to look at the relatedness of text query and the indexed textual terms. The module returns a list of 100 top-ranked text search results to the Searcher.
The job of TME module is to check for relatedness between text query and indexed formulae vector, and between mathematical query vector and indexed textual terms. However, as of now, the TME module is not implemented.
Since text and math contents are indexed in isolation, the text and the math search results may share commonness. Given the fact that common search results will contain text as well as math, the Ranker module ranks them higher than the search results containing only math or only text. More specifically, the search results are ranked in the following order:
(i) search results containing text as well as math (if any) (ii) search results containing only math (if any) (iii) search results containing only text (if any).

Formula vector generation process for the formula y = ex+z.
This section details Queryset and Gold Dataset used for experimentation, and evaluation parameters used for system evaluation.
Queryset description
The queryset comprises of 10 MathML queries of Wikipedia Task 3 of NTCIR-12 MathIR task (nine from Wikipedia Main Task and one from Wikipedia Formula Browsing Task). Majority of queries contain MathML formula as well as text keywords. Similar to the documents, the queries are encoded using Presentation MathML, Content MathML and TEX. However, out of the three encodings, the system is experimented using Presentation MathML encoding of queries.
Gold dataset description
The format of Gold Dataset (also called qrel file), provided by the organizers of NTCIR-12 MathIR task, adheres to the Text REtrieval Conference (TREC) 4 qrel format [18]. The Gold Dataset contains following fields: (i) QueryID (ii) Iteration (iii) Document# (iv) Relevance. For each query (specified by QueryID) in the queryset, the Gold Dataset contains a set of human assessed documents (specified by Document#), which are either judged relevant (Relevance = 1) or judged irrelevant (Relevance = 0). Iteration field is set to 0 and ignored by the trec_eval 5 evaluation tool.
Evaluation parameters
System generated results are evaluated on the grounds of Precision_k (P _ k) (k=5, 10, 15, 20) evaluation measures. P _ k for an individual query is measure of the number of relevant documents out of the first k retrieved documents. Individual P _ k measures are averaged over all the queries to obtain the final P _ k scores.
Results and analysis
Format of result set
Similar to the Gold Dataset, the format of Result Set adheres to the TREC format. Results Set is characterized by following fields: (i) QueryID (ii) Iteration (iii) Document# (iv) Rank (v) Similarity (vi) Run_ID. Similarity field takes floating point real values and indicates extent of similarity between the query (specified by QueryID) and the retrieved search result (specified by Document#). In our experiment, similarity score, discussed in Subsection 2.2, is used in the Similarity field. The Run_ID field highlights system run and gets printed on the evaluation report. Among the six fields, only QueryID, Document# and Similarity are used for evaluation, and the remaining three trivial fields are ignored by the evaluation tool.
System results
Labeled bar graph, shown in Fig. 6, shows values of evaluation parameters for the implemented system. The bar graph depicts performance of the system for original queries containing math as well as keywords. Also, the queries are shown in Table 3. Performance of the system is compared against a text-based search engine implemented using Apache Nutch 6 platform. Apache Nutch uses Apache Lucene as its indexing core. The same experimental framework (Corpus, Gold Dataset and Query Set) has been used to retrieve results from the conventional text-based search engine. Labeled bar graph, shown in Fig. 7, depicts values of evaluation parameters for the two systems.
Top-ranked search results (documents and their excerpts) for some of the queries
Top-ranked search results (documents and their excerpts) for some of the queries

Values of evaluation parameters for the implemented system queried using original NTCIR queries.

Performance comparison of the implemented BVTMF based system and a text-based search system.
The system results will be looked at, in more detail, in this subsection. Following analysis can be made for the system generated results: Reasonable values of evaluation parameters for the original queries (text+math) prove competence of the implemented system in retrieving mathematical as well as textual contents. However, inability to secure fairly high score, close to 1, can be attributed to the inefficiency of ranking method, and the failure of Binary Vector Transformation Module in capturing all information content of the formulae. Currently, the search results containing information content (lexical entities and their positional information) of the user query qualify among top-ranked results. This need not necessarily be always correct. For instance, given the user query, “F = ma”, search result “x = yz” will be more relevant than “F = a
m
”, even though the latter contains more information content of the given user query. Given a user query, system flawlessly retrieves all search results wherein the query term is either present in its exact form or present as part of a larger parent formula (see results 6 and 7 of Table 3). This happens because the similarity score for such results will be maximum. The system successfully retrieves sub-formulae and similar formulae search results for a given user query (see results 8 and 13 of Table 3). The system efficiently handles user query containing keywords as well as math (see results 10 and 11 of Table 3). To emphasize the importance of context (in form of keywords) of formula, the keywords from original NTCIR queries are removed. For example, the original NTCIR query “Legendre ( Considerable difference in the values of corresponding evaluation measures are indicative of the fact that the implemented system significantly outperforms a conventional text-based search system (see Figure 7). This is attributed to the fact that math search is way different than conventional text search in terms of retrieval needs of the end users, and the manner in which the query term is matched to the indexed terms. Values of evaluation parameters for the implemented system queried using modified NTCIR queries.

Work presented in this paper ascertains role of binary vector transformation of math formula in Mathematical Information Retrieval (MIR). Using a Bit Position Information Table (BPIT), the Binary Vector Transformation Module of the implemented system captures and vectorizes information content of preprocessed document formula and mathematical user query. Thereafter, the extent of similarity between indexed formulae and user query guides the retrieval and ranking steps. Besides, using Apache Lucene, the text contents are indexed and searched in conventional fashion. System generated results are indicative of the capability of the system to retrieve exact formula, sub-formula and similar formula search results. Also, the values of evaluation parameters (P _ 5 = 0.64, P _ 10 = 0.54, P _ 10 = 0.52 and P _ 20 = 0.47) substantiate reasonable performance of the discussed approach in retrieving math formulae and text from scientific documents for the given queryset.
Reduction in index size and query search time, enhancement in context analyzing ability of the searcher module and exploring possibility of embedding text along with the math are few of the notable future directions, which can further improve the performance of the implemented system. Moreover, in future, the performance of system will be prospected using the complete list of queries of Wikipedia main and browsing tasks of NTCIR-12 MathIR task.
Footnotes
Acknowledgement
The work presented here falls under the Research Project Grant No. YSS/2015/000988 and supported by the Department of Science & Technology (DST) and Science and Engineering Research Board (SERB), Govt. of India. The authors would like to acknowledge the Department of Computer Science & Engineering, National Institute of Technology Mizoram, India for providing infrastructural facilities and support.
