Abstract
Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different than the target documents language. CLIR incorporates a machine translation technique, like, Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) which use either a dictionary or a parallel corpus for the training. A Hindi language word may have multiple variations due to the morphological richness of the language, these morphological variants may or may not be present in the dictionary or parallel corpus. The morphological variants which are not present in the dictionary or parallel corpus, are not translated by the state-of-art SMT or NMT translation techniques. Conventional Information Retrieval (IR) technique eliminates the stop-words to improve the IR effectiveness, but there are some significant stop-words whose presence may improve the IR effectiveness. In this paper, a translation induction algorithm, incorporates the refined stop-words list, morphological variants solutions, and translates the words based on the contextual words, is proposed. The proposed algorithm is compared to the manual dictionary, probabilistic dictionary, SMT and NMT based translation techniques for the experimental analysis of Hindi-English CLIR, where it outperforms the other CLIR approaches.
Keywords
Introduction
Nowadays, the Internet has overwhelmed by the multi-lingual content, global Internet usage statistics also shows that the numbers of web access by the non-English users are continuously increasing, who used to query in their regional languages1 Classical Information Retrieval (IR) considers other language documents and sentences as the unwanted noise [1], therefore, a need for handling multiple languages arises which introduces a new area of IR that is Cross-Lingual Information Retrieval (CLIR). CLIR provides the accessibility of relevant information in a language different than the query language [4]. CLIR can be presumed as a translation technique followed by the monolingual information retrieval, where the translation technique is used in two ways, namely, query translation in which the queries are translated into the target documents language and documents translation in which the target documents are translated into the query language. A lot of computation time and space is elapsed in a document translation technique, so a query translation technique is preferred [3]. Dictionary-Based Translation (DT), Corpus-Based Translation (CT) and Machine Translation (MT), are the conventional translation techniques [2]. DT uses either a Manual Dictionary (MD) or a Probabilistic Dictionary (PD) which is constructed by training an IBM model on a parallel corpus. MT techniques internally use the parallel corpus, so the researchers put their efforts towards the development of effective and efficient MT techniques and related translation resources.
Stop-words are the frequently occurring words which do not carry any significant information, hence, the stop-words are eliminated to enhance the IR effectiveness [6], but some stop-words carry significant information which may improve the IR effectiveness. A parallel corpus is a set of sentence-aligned bilingual text which has limited vocabularies. A morphological variant word has many forms, all forms are not present in the parallel corpus but at least one of them may be present in the parallel corpus, such morphological variants are considered as one of the type of the Out Of Vocabulary (OOV) word [1, 9]. State-of-art Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) techniques are unable to translate such morphological variants because these techniques are not so matured for Hindi to English translations [14].
In this paper, (i) a Translation Induction Algorithm (TIA) is proposed which incorporates the refined stop-words list and morphological variants solutions. Significant stop-words are eliminated from the standard stop-words lists to produce new refined stop-words lists for both of the source and target languages, morphological variants solution are added to fix the morphological irregularities, and the contextual parallel sentences are exploited to compute the best translation, (ii) the old parallel corpus “HindiEnCorp” and newly developed parallel corpus by CFILT lab at IIT Bombay “IITBCorpus”, are tested for Hindi-English CLIR, where HindiEnCorp performs better in perspective of CLIR due to its better organization than the IITBCorpus. The paper is structured as follows: the literature survey is discussed in section 2, a translation induction algorithm is proposed in section 3, the experimental setup is discussed in section 4, results and discussions are represented in section 5, and section 6 represents the conclusion.
Literature survey
Direct translation approaches DT, CT, and MT, and the indirect translation approaches Cross Lingual-Latent Semantic Indexing (CL-LSI), Cross Lingual-Latent Dirichlet Allocation (CL-LDA), Cross Lingual-Explicit Semantic Analysis (CL-ESA) are used for query translation in CLIR systems [23, 31]. An MD is used to translate the words which are exactly mapped in the MD, and the words which are exactly not mapped in the MD, are partially matched by using the approximate string matching techniques [16]. Still, there are some words which are neither exactly mapped nor partially matched, are considered as the OOV words and such words are translated by using the transliteration generation and mining algorithms [21, 27]. These techniques are not able to fix all types of morphological irregularities, like, nukta characters, infrequent words, multiple morphological variants. In the prior work, a set of parallel sentences is used to translate the query words where the number of selected parallel sentences are more, and the morphological irregularities are also not fixed [25].
An open source & language-independent machine translation toolkit
Moses2 is trained on a sentence
aligned parallel corpus [10, 12], where an IBM model is used to learn a word alignment table. The
Hindi and English language sentences are given as
h = {h1, h2,
. . . , h
m
} of length m, and
e = {e1, e2,
. . . , e
n
} of length n. An
alignment function a : j → i for an
English word e
j
to a Hindi language word
h
i
is given as
Neural networks impart a significant role in the field of data mining as it achieves surpassing results. An NMT system is developed and evaluated for the various foreign languages, a Recurrent Neural Network (RNN) based encoder-decoder architecture is trained on a parallel corpus to learn the conditional distribution, where the conditional probability of generating a target language sentence against a source language sentence is to be maximized. In RNN based encoder-decoder architecture, a source language sentence is encoded into a set of vectors and this encoded set of vectors is decoded into the target language sentence [4, 30]. For example, a sentence pair X=(x1,x2,...,x M ), Y=(y1,y2,...,y N ) of size M and N is taken as input, the encoder simply encodes the sentence X into a set of vectors, as given in Equation (4).
Stop-words are the frequently occurring words which do not convey any significant information. Generally, the stop-words are removed from the queries, but some source and target language stop words have multiple meaningful target and source language translations respectively which may convey a significant information, the examples of such stop-words are represented in Table 1. These significant stop-words need to be eliminated from the Hindi and English standard stop-words lists, such stop-words are listed in Table 2.
List of stop-words and their meaningful translations
List of stop-words and their meaningful translations
List of significant stop-words for Hindi and English language
Refined Stop-Words (RSW) lists for Hindi and English are produced after the elimination of the significant stop-words, further, these refined stop-words are removed from the queries.
Query words which are exactly mapped in the parallel corpus, are translated by a look-up technique. The Longest Common Subsequence Ratio (LCSR) string matching technique is used to translate such morphological variants which are not present in the exact form, the LCSR between two string a, and b is computed by Eq. (6).
LCS(a,b) returns the Longest Common Sub-sequence between the strings a,
and b. At many instances, LCSR is unable to catch the morphological
variants due to the morphological irregularities in the Hindi language, therefore, following
Morphological Variants Solutions (MVS) are applied to catch the approximately nearer word of
the queryword. Equality of nukta character with the non-nukta character: LCSR is unable to detect
the equality between the nukta and non-nukta characters, like, Auto-correction of user query words: A query word is searched in the parallel corpus
as it appears, its correctness is not verified. A word’s popularity based
auto-correctness solution is applied, where a query word’s frequency
wf
i
over the parallel corpus is
computed and compared to the empirically defined threshold T. If
wf
i
is less than T, then the nearest
word’s (using LCSR) frequency cwf
i
over
the parallel corpus is computed. If
cwf
i
> wf
i
,
then the query word is replaced by its nearest word. Examples of such words are shown
in Table 3. Equality of chandra-bindu with Auto-selection of the nearest query word: LCSR score is used to select the nearest
word if the word is exactly not present in the parallel corpus, such words may have
multiple morphological variants with the similar LCSR score as shown in Table 4. The Compressed Word
Format (CWF) algorithm [11] is used to
auto-select the nearest query word, so far, the CWF algorithm is used for
transliteration mining.
(sadak),
(ladai),
(parvez), hence, an equality solution is
applied where nukta characters are replaced by the non-nukta characters.
(m) and
(n): A query word with chandra-bindu is equivalent to many
other words, like, a word “
” (ambanee) has similar LCSR score 0.83
with these three words “
” (ambanee), “
”
(ambajee), and “
” (albanee), if the chandra bindu is replaced by “
” (m) then a correct word “
” (ambanee) is selected which has the
maximum LCSR score.
Auto-corrected words
Multiple nearer words with same LCSR score
The proposed algorithm is represented in Algorithm 1, where the query words are searched in the parallel corpus after applying the refined stop-words and morphological variants solutions. An LCSR string matching technique is applied for the words which are exactly not matched in the parallel corpus, and such words are replaced by the nearest words in the query.
A Sorted Parallel Corpus (Sorted_PC) is prepared by sorting the parallel corpus based on the sentence length, further, a set of parallel sentences is selected for each query word w i from the parallel corpus in a contextual manner such that each sentence contains either all three words of tri-gram or both of the words of bi-gram, independent of words order. The function N _ Grams () returns trigrams or bigrams. If the number of selected parallel sentences is less than a threshold t, then z number of unigram based parallel sentences of minimum length are also included. Term Frequency-Inverse Document Frequency (TF-IDF) indexing is applied to the selected parallel sentences, further, cosine similarity scores are calculated between the query word and all target language words of the selected parallel sentences. A maximum cosine similarity scorer target language word will be the best translation. In the proposed algorithm, a context-based selection of the parallel sentences returns the more relevant translation. Target language refined stop-words are removed while performing target documents retrieval.
FIRE5 2010 and 2011 datasets are used to evaluate the CLIR system. Dataset statistics is represented in Table 6. A CLIR system is evaluated by using Mean Average Precision (MAP). MAP for a set of queries is the mean of the average precision score of each query. Precision is the fraction of retrieved documents that are relevant to the query. Average precision of query is calculated in Eq. (7).
Where k is the rank in the sequence of retrieved documents, n is the number of retrieved documents, p (k) is the precision at rank k, rel (k) is equal to 1 if the document at rank k is relevant otherwise 0.
Different experimental setups are prepared for the SMT, NMT, MD, PD, and the proposed TIA, by using the different resources which are represented in Table 5.6
Fire dataset statistics
Resources for the training of the SMT and NMT systems
SMT based approach: An SMT system is trained in three ways by using different
resources, which are given as follows. SMT_setup1: HindiEnCorp is used for both of the purposes of training and language
modeling. SMT_setup2: IITBCorpus is used for both of the purposes of training and language
modeling. SMT_setup3: IITBCorpus is used for training, while the WMT news corpus 2015 is
used for language modeling.
NMT based approach: An attention based RNN encoder-decoder system is trained in four ways by using different resources [2], which are given as follows:
NMT_setup1: HindiEnCorp is used for training with the dropout value 0.0.
NMT_setup2: HindiEnCorp is used for training with the dropout value 0.2.
NMT_setup3: IITBCorpus is used for training with the dropout value 0.0.
NMT_setup4: IITBCorpus is used for training with the dropout value 0.2.
A Byte Pair Encoding (BPE) with 15,500 merge operation is used to learn the vocabularies [22] in the subword-nmt7 tool. An open-source NMT8 tool is used to train the attention based RNN encoder-decoder with double hidden layer, the embedding dimension is 512 units at each layer and 20,000 training steps.
MD based approach: An MD “Shabdanjali”9 is used to translate the exactly-mapped words. LCSR is used to search the not-exactly-mapped query words, where the empirically defined threshold for LCSR is 0.75. The MD based approach is also analyzed with the refined stop-words list.
PD based Approach: A PD is learned from the HindiEnCorp and a maximum probability scorer translation is selected among the multiple translations. LCSR is used to search the not-exactly-mapped query words, where the empirically defined LCSR threshold is 0.75. A Refined stop-words list is used instead of the standard stop-word list to analyze the impact of refined stop-words. The best translation will be chosen from the top-k translations, where k = 5 is an empirically defined constant.
TIA: The proposed TIA uses different thresholds, i.e., LCSR threshold = 0.75, t = 10, T = 5, and z=70.
Fire 2010 and 2011 Hindi language queries are translated by using the SMT, NMT, MD, PD, and TIA, further, these translated queries are used to retrieve the target English language documents. Target language refined stop-words are eliminated from the target documents. Term Frequency-Inverse Document Frequency and cosine similarity are used for indexing and retrieval respectivey.
SMT and NMT systems are expected to generate fluent translation output, hence, the SMT and NMT may add unnecessary translations which actually increase the noise in the translation. MD and PD have translation pairs where a word may have multiple translations. An MD based approach selects the first translation pair, and the PD based approach selects the maximum probability scorer translation. The proposed TIA selects the translation based on the contextual words. TIA also incorporates the refined stop-word list and morphological variant solutions. The MAP for the SMT, NMT, MD, PD, and TIA are represented in Table 7.
Experiment results (MAP) to analyze the impact of TIA and RSW list
Experiment results (MAP) to analyze the impact of TIA and RSW list
SMT_setup1 which is trained on the HindiEnCorp performs better than the SMT_setup2 and SMT_setup3 which are trained on the IITBCorpus. HindiEnCorp is a subset of IITBCorpus but it is better organized, so SMT_setup1 achieves better performance than the other SMT setup. The SMT_setup3 uses WMT news corpora 2015 for language modeling, so it performs a little better than theSMT_setup2.
A neural network requires huge data for its training. NMT_ setup4 which is trained on IITBCorpus, achieves better MAP compared to other NMT setups, for both of the Fire 2010 and 2011 datasets. Experimental results show that the SMT system performs better than the NMT system because the SMT uses word alignment table while the NMT uses context vectors (attention mechanism).
MD, PD, SMT, and NMT based CLIR approaches with the standard stop-words list, are considered as the baselines. The proposed TIA incorporates the morphological variants solutions, so it achieves better MAP than the others. Baseline approaches and the proposed algorithm are also tested with the refined stop-words list, where the RSW list improves the MAP compared to the standard stop-words list as shown in Table 7. The proposed TIA incorporates the RSW and MVS, and the translations are computed based on the contextual words, therefore, it outperforms the MD, PD, SMT, and NMT based approaches.
A separate experiment is also performed to analyze the impact of the refined stop-words lists, where, Fire 2010 and 2011 Hindi and English topic sets (queries) are used for monolingual information retrieval. The topic set which has three tags for each query, namely, 〈Title〉 (T), 〈Desc〉 (D), and 〈Narr〉 (N), are individually experimented with a standard stop-words list and refined stop-words list, and evaluated by using the MAP. These experiments are performed on three fields of the queries as shown in Table 8. Refined stop-words list achieves better MAP for Fire 2010 topic sets in both of the languages while it achieves approximately equal performance for Fire 2011 topic set because Fire 2010 topic set has more stop-words than the 2011 topic set and the average query length for Fire 2010 and 2011 topic sets are 6 and 3.
A monolingual information retrieval results (MAP) to analyze the impact of RSW list
CLIR incorporates a machine translation technique followed by an information retrieval technique. User queries are translated by using the MD, PD, SMT, NMT, and the proposed TIA. HindiEnCorp is smaller and better organized than the IITBCorpus, and SMT_ setup1 uses a HindiEnCorp, therefore, its performance is better than the other SMT and NMT setups. The standard stop-words list has some significant stop-words whose presence may improve the CLIR performance, so these significant stop-words are eliminated from the standard stop-words list. The newly generated refined stop-words list enhances the MAP compared to standard stop-words list. SMT and NMT systems do not deal with the morphological variants while the proposed TIA incorporates the refined stop-words and morphological variants solutions, apart from that, the query words are translated based on the contextual words, therefore, the TIA outperforms the other approaches.
