Abstract
In this work, we present a morphological segmenter for the Mexican indigenous language Wixarika. Segmentation is fundamental for rich morphological languages, a common aspect of the native American languages, to improve other tasks like machine translation, dialogue systems, summarization, etc. On top of the agglutinative nature of the language, the low amount of resources and the lack of an orthographic standard among dialects add to the challenge. Our proposal is based on a probabilistic finite-state approach that exploits regular agglutinative patterns and requires little linguistic knowledge. We show that our approach outperforms unsupervised and semi-supervised methods in a low-resource context. The dataset used in this work was openly released for future work by the community.
Introduction
Indigenous languages face several challenges in the new technological context. Many of them are endangered due a new technological reality based on other spoken languages that are more culturally dominant. People from all ages now use different computing technology (mobile phones, smart phones, laptops, computers) with their own language, however it is common to find among indigenous native speakers code switching (mix of native language with dominant language), or that they use the dominant language. Natural Language Processing (NLP) can contribute to preserve and vitalize these languages by providing ways these can be analyzed and used in high level NLP applications such as: translation, dialogue systems, summarization, etc.
Wixarika is a language spoken in the Mexican states of Jalisco, Nayarit, Durango and Zacatecas (central west of Mexico). It is approximately spoken by fifty thousand people. Like most South and North American indigenous languages, Wixarika has complex verbal morphology [2]. For instance, the word nep+ka’ukats+k+, which can be translated into English as “I don’t have a dog” is segmented into the morphs ne|p+|ka|’ u|ka|ts+k+. It is important to notice that in Wixarika the symbol + is used to denote one of the vowels in the language; for this reason, we will use | symbol to delimit its morphemes. Notice that although this word is a verb form, its agglutinative nature makes it a full sentence. In this example ts+k+ is the stem and means “dog”, ne is a first person possessive, ka negation, ’u refers to a visual object and ka is the second part of the negation. The study of the Wixarika language from the point of view of technologies is difficult, first because NLP resources that allow the computationally process of the language are few and second the morphological richness makes it harder to adapt tools from more common studied languages such as English or Spanish. Since the first step in many of the NLP tasks depends in a correct segmentation of the language sentences, in this work we present a probabilistic finite-state morphological analyzer for the Wixarika language 1 . The goal of the segmenter is to identify the boundaries among morphs of the language, a good segmentation will have a great effect on the performance of other tasks such: machine translations, summarization, etc.
Different linguistic studies have recorded Wixarika in written form, but its spelling is still not standardized. The most common spelling in practice by native speakers is an alphabet of 18 symbols: Σ = {a, e, h, i, +, k, m, n, p, r, t, s, u, w, x, y, ′}, as proposed in Gómez [7] and Iturrio and Gómez López [13]. In this work we follow this convention for the description of the work and the resources created. When the language imports words from other languages, like Spanish, unused symbols can be added to the enumerated alphabet. Although, Spanish words are also often adapted to the existing alphabet, e.g. the name Jesus uses j and can be transformed to kets+. Table 1 shows a normalization of the symbols for other alphabets conventions.
Wixarika alphabet normalization. The symbols a, e, h, i, k, m, o, p, r, t, u y appear in all versions
Wixarika alphabet normalization. The symbols a, e, h, i, k, m, o, p, r, t, u y appear in all versions
Results for the morphological segmentation task on Wixarika using direct comparison to the gold segmentation: Edit distance (ED) and accuracy (Ac.). Results for the morphological segmentation task on Wixarika using EMMA metric. M. stands for Morfessor, P for precision, R for recall and F for the F-measure
The output of the morphological segmenter is a list of substrings called morphs given a w word in Wixarika. Past research has focused on unsupervised methods, but they can only be applied to languages for which there exists a sufficiently large corpus of words [19]. This task can be done on the surface level in high agglutinative languages, and for those languages we do not need to infer any fusioned morpheme type. In this work, we propose a semi-supervised approach in which a seed of morphological labeling interact with a set of rules in a hybrid fashion to produce the corresponding segmentation.
For indigenous languages like Wixarika with scarcity of digital available resources defines a bound on the performance of these methods. This scarcity broad for most Mexican indigenous languages. For instance, efforts to gather large collections of digital texts for Yutonahua languages exist only for Nahuatl [10]. In the case of Wixarika some prior work on Statistical Machine Translation has been done [17] but the set of examples still is limited.
On the other hand, rule-based automatic morphological analyzers require deep knowledge of the language or the expensive support of linguists [4]. Rule-based morphological analyzers have been developed for Quechua, Toba [18] and Aymara languages [12]. However, for a poorly studied language it is difficult to create such resources because the research switches from the computational aspects to the linguistic properties of the language, although this type of research are also necessary for all indigenous languages in this work we choose to focus on the computational aspect.
Our approach to the morphological segmentation of Wixarika deals with the scarcity of linguistic knowledge and a large digital formatted corpora, since we propose a hybrid system that combines morphological knowledge from descriptive grammars of the language with a probabilistic model learned from supervised data (previously seen segmented words).
There are also efforts for NLP tasks for other languages, such as Machine Translation (MT) from/to such languages, which has usually been done with rule-based approaches. For example the Quechua - Spanish translator developed by [1] uses the rule-based Apertum Translation System [5], and Zapoteco-Spanish translation application [8].
Our contribution is the construction of the first morphological analyzer for Wixarika, using hand-specified lists of legal stems and affixes together with an n-gram model that describes sequences. This hybrid method can achieve good performance for a morphologically rich language with scarce resources and low grammatical knowledge.
Wixarika belongs to the family of Yutonahua languages, such as Nahua, Nayeri, Raramuri, etc. These languages have agglutinative morphology, using prefixation as well as suffixation around the verb stem. Nouns can be used to act as stems in verbs. The affix “p+” serve as the verbification morpheme. The agglutination is almost strictly concatenative, and each morpheme must be realized at a specific position in the word. The same string in a different position conveys a different meaning: e.g., the prefix ne- in position 16 acts as a pronominal morpheme, but in position 3 it is a possessive morpheme [7]. There are 18 such prefix positions and 23 suffix positions identified by Iturrio and Gómez López [13], where each position allows a certain set of morphemes (or can be left empty).
This description of the language can be used to construct a finite-state transducer (FST) from a list of legal morphemes at each position. Although there are more complex rules that govern sequences of morphemes, we will assume that the only condition is that each position allows only morphemes from its list. The errors introduced by this assumption will be corrected later by the n-gram model.
The stem is not defined by any rule and it can be based on words from other languages (e.g., Spanish). For the present study, however, we limited the possible stems to a tuple of 374 strings learned from examples. The list of sets used for affixes was taken from the linguistic work of Iturrio and Gómez López [13], which is a revision of an earlier study [9] and contain 131 morphemes.
A finite-state transducer can accept any string w in the language that it defines, and returns a set of accepting paths. The complete automaton for Wixarika verbs is shown in Fig. 1 its different accepting paths for a word w correspond to different morphological analyses of w. The FST has 47 states the prefix states are marked as p while the suffix are marked with s and it is composed by 374 different arcs.

Final State Transducer (FST) for the wixarika language. The “stem” arc stands for a collection of 374 arcs representing different stems.
In practice, there are few enough analyses that we can enumerate all of them. To choose the most probable analysis from among these, we used a simple n-gram model with Kneser-Ney smoothing [15], where each gram is a morph (a surface string associated with some morpheme). This model scores the sequence of non-empty strings (morphs) without considering their absolute positions. As a result, it can be trained simply from a segmented corpus.
As the automaton already found all possible segmentations, we only need to evaluate the probabilities for each n-gram and return the best-ranked segmentation. This was done by enumerating all of the segmentations and scoring them separately with the n-gram model.
It is also possible to place weights on the FSA arcs, but placing weights on each arc would only allow a 1-gram model. For an n-gram model, we need to split states, so that each state remembers the previous n - 1 morphs; that would result in a much bigger machine. Irregular agglutinations and unknown stems can mislead the automaton, so it sometimes fails to recognize an input word. If this happens, we can fall back to an unsupervised method to analyze this word. Usually an unsupervised analyzer under-performs with scarce resources, but it can improve the final segmentation in practice.
For our experiment we collected two Wixarika corpora as shown in Table 4. The first is a high-quality segmented text taken from a grammar [7] containing 1, 079 type words, which we used as our gold standard. We randomly selected 400 words from these words, to be used as a test set, and the rest were used for the training of 51 semi-supervised Morfessor models and our n-gram model trained on the probabilities on morph tokens [23].
The second source of words is a translation of Hans Christian Andersen’s classic fairy tales3 to Wixarika containing 17, 131 non segmented word types, used for the training of the unsupervised based on Morfessor model. It is important to notice that both resources although in Wixarika language, they are different dialects.
Evaluating morphological segmentations is difficult since for a single word there are several valid segmentations. There are two types of metrics for morphologies: those that directly compare the hypotheses against the gold standard and those that perform the comparison indirectly “by measuring the strength of an isomorphic like relationship between the proposed and answer morphemes” [20].4.
In this work, we used both types of metrics. For direct comparison we follow Kann et al. [14], using the accuracy and the edit distance of morphs between the hypothesis and the golden standard. For the indirect evaluation we used EMMA [20], which produces precision, recall and F-measure scores.
Table 3 presents a summary of the direct results using Moferssor under three scenarios: with segmented data (WSD), without segmented data (WND) and with large dataset (corpus) (WLD). In the first case, the segmented data, this is the 679 training words are passed twice to Morfessor first in a semi-supervised and the second as a fully superviside. In the second case, Morfessor only sees the collection of 679 words but without segmentation. Finally, in the third case add the 17, 131 words to the second case. As it can be appreciated there is not difference among first and second cases, we attribute this to dialect differences among the resources.For training the Morfessor model we varied the corpus weight value from 0.1 to 7 in steps of 0.5 for the recursive algorithm. In addition we also trained a model with the viterbi algorithm. In this last one case we see that viterbi is more susceptible to make errors without the segmented information. We also see the importance of the corpus weight, a small value produces a weak segmenter, however there is a good spot to define it.
Results for the morphological segmentation task using different semi-supervised Morfessor setting on Wixarika. P stands for precision, R for recall and F for the F-measure
Results for the morphological segmentation task using different semi-supervised Morfessor setting on Wixarika. P stands for precision, R for recall and F for the F-measure
Description of the two data sets used to train the segmentation models. The small data set is a high quality phrase collection of segmented words. The large data set is a parallel corpus of a translation of Hans Christian Andersen’s classic fairy tales, but has no segmentation available
Our proposed
Table 2 show that the experimental results using Morfessor suffer from the lack of examples in order to infer a good model. However, this can improve by using the segmented corpus as base of the model. In all trained models we included a semi-supervised setting using a list of words with their segmented version from the development set. We only changed the amount and form of the base data of the model: those trained with a non segmented corpus got a poorer results as those trained with only segmented data. Using a larger dataset helped, but was not the best version. On the other hand, our first model, WixNLP without maximizing the probabilities of paths improves Morfessor without even using probabilities for disambiguation. The criterion used for choosing a path was the path of minimum length. WixNLP with 2-grams under performs normal WixNLP in indirect evaluation but obtains better results in accuracy and edit distance. The version using 3-grams improve the results notably in all metrics. The hybrid approach deals with the problem of unseen roots and suffixes, and thus achieves the best results in all metrics, particularly with a 3-gram model.
Morphological segmentation has a large history, Harris et al. in 1951 did the first research on the task [11]. Since then, many approaches have been developed. Two essential systems were used with good results for semisupervised segmentation: LINGUISTICS [6] and the extended version of MORFESSOR [16]. MORFESSOR also has an earlier unsupervised method [3]. But, rule-based language specific FSM has been developed for many languages archiving good results. For indigenous languages such morphological analyzers have been developed for Quechua, Toba [18] and Aymara languages [12]. These works differentiate from our approach since they try to model the complete underlying morphology of their languages, and WixNLP only uses a list of morphemes and infer the morphological rules. For the specific case of Wixarika, there was no previous attempt to implement any finite-state approach.
Conclusion
Morphological segmentation is an important task for language processing for the Wixarika language. In this work we presented the first Wixarika morphology analyzer, a finite-state transducer that exploits the agglutinative pattern of Yutonahua languages, with lists of stems and affixes, together with a n-gram model to estimate the best segmentation among multiple matches. We showed that for Wixarika our method improves on the Morfessor baselines.
We also created and publicly released a parallel Wixarika-Spanish dataset to encourage the community to study this language further. Together with these corpora we release a couple of NLP tools, such as a normalizer and tokenizer to handle Wixarika texts. This is an important tool due the lack of an orthographic standardization among the native speakers.
For future work, we would apply this methodology to other Yutonahua languages. We also want to feed the morphological segmentation to a MT system. It is also desirable to find improved methodologies to combine unsupervised with supervised methods to address the scarce resource problem for agglutinative languages, including tagging each morph as in [21].
