Abstract
The aim of this article is to contextualize and describe the gathering and annotation of a conventual Hispanic and Novo Hispanic texts corpus for emotions identification. Such corpus will be the dataset for an emotions identification model based on machine learning ∖ deep learning techniques. Furthermore, this document describes several exploratory experiments carried out on the corpus. Within these experiments, it is described how the corpus is also used to obtain a lexicon mapped to polarities and emotions, and how some of the documents are hand-labeled by experts for the evaluation of the Machine Learning ∖ Deep learning -based emotion classification model. Finally, the future uses and experiments with said corpus are described.
Introduction
The importance of historical texts lies in the fact that references of authors, titles, and places within manuscripts represent an intellectual portrait of a specific time in history, providing us, with academic and religious rigor, the ability to use and manage information through reading and writing. Texts represent a formal means of communication through which the authors related to the scientific communities of their historical period.
All this handwritten work is a bibliographic heritage, a valuable testimony of a historical period preserved to this day. Those works and testimonies are part of the greatness of culture; they reflect the use of methods, discussion systems, and arguments [1].
The particular case of conventual texts, written mainly by religious women, helps us understand several aspects of their daily life, which was embedded with manifestations of the supernatural-divine. The handwritten testimonies of women in the cloister show the importance of the role of nuns and pious women in the development of various literary genres. And, at the same time, this type of texts are very rich, deeply reflecting the emotions of the authors [2].
The corpus (body of written or spoken material on which a linguistic analysis is based) to be produced will be a compilation of digital files containing historical texts which will be automatically analyzed for emotion identification. For this task, texts provided mainly by researcher Dr. Rosalva Loreto López, Director of University Historical Heritage at BUAP, have been considered. The texts use as discrimination criteria their length (from 30 pages), century of creation (16th to 18th centuries), place of creation (New Spain or Spain), as well as their type (mainly biographies, autobiographies and personal letters). These texts have already been translated into contemporary Spanish, and have been previously reviewed by expert historians and / or linguists in order to guarantee their validity as historical texts, their correct translation and the integrity of their content. Such corpus will be the dataset for an emotions identification model based on machine learning ∖ deep learning techniques.
This document is organized as follows: Section 2 presents the state of the art through the discussion of some works related to the subject of this work; section 3 talks about the corpus as a source of data for the emotion identification project; section 4 describes the corpus and its construction process; section 5 presents some preliminary experiments with the data contained in the corpus; section 6 mentions the future work that can be done; finally, in Section 7, the current results and conclusions are summarized, followed by the references.
Related works
Historical corpus and text collections have been built for various languages –Arabic, Chinese, Dutch, English, French, Galician, German, Ancient Greek, Icelandic, Japanese, Latin, Old Norse, Polish and Portuguese–in order to allow linguists to examine phenomena and develop linguistic theories, and to support the development of Natural Language Processing (NLP) tools through the provision of training and test data [3].
The transcription of historical documents is extremely important for the creation of digital libraries, and, in order to make these documents navigable, a quality transcription is necessary. In this regard, researchers have conducted projects for the transcription of medieval documents. For instance, testing methods such as hidden Markov models and Bidirectional Long Short Term Memory Neural Networks (LSTM) on the Parzival database, which is a collection of 13th century manuscripts written in Middle High German by various writers [4]. A similar work with the Rodrigo data set –which contains the digitization of an old Spanish manuscript used for handwritten text recognition–focuses on solving the problem Out-Of-Vocabulary words in ancient historical texts. This work associates a powerful optical recognition system capable of dealing with noise and image variability, with a language model based on sub-lexical units that model OOV words [5].
Regarding the mining of historical texts, we can find works such as the Ryan Report, a report of the Irish government’s investigation on child abuse in Irish industrial schools between 1920 and 1990 [6]. The work TRADING CONSEQUENCES, part of the Digging into Data project, facilitates the goal of helping environmental historians understand the economic and environmental consequences of the commodity trade during the 19th century using text mining [7].
Within the analysis of historical texts in Spanish, we can find CODEA, Corpus of Spanish Documents Preceding 1700, an online corpus that contains 1,500 documents dating from the 11th to 17th centuries. Using tools such as search engines and a visualization of the results in graphs and maps, the authors turn the corpus into a powerful research tool for linguistic variation, both diachronic and geographical [8]. The DECM (Digging into Early Colonial Mexico) is a project that explores computational techniques using a Big Data approach. It focuses on analyzing a 16th century corpus known as “The Geographical Relations of New Spain”, which consists of documents related with New Spain, particularly the areas encompassing present-day Mexico and Guatemala. Through statistical learning, textual information is identified, extracted, and analyzed through a combination of natural language processing techniques, machine learning, and text mining [9].
In the area of sentiment analysis for historical texts, there are tools such as ALCIDE, an online platform for the analysis of historical content [10], which has developed a new lexical resource with a semiautomatic mapping from two lexica (lexicons) in English. Its long-term goal is to create a system to support historical studies capable of analyzing sentiment in historical texts and discovering opinion on a topic, as well as its change over time. There are other lexicon-based works to quantify emotions in non-contemporary texts. This approach has been applied to a 210-year-old section of the German DTA (Deutsches Textarchiv) corpus, finding clear emotional cues that temporarily evolve among the main literary forms [11].
Among the few works that perform sequential analysis of emotions in historical texts is that of Sreeja P. S., who uses directed graphs to identify the flow of emotions in poems written by Indian poets in English [12].
There are also some works dedicated to mining or to the sentiment analysis only from religious texts. Mainly, these are works that make comparisons between representative texts of different religions, and their objective is to explore similarities and differences between various religious texts based on automatic text processing using current mining methods [13–17].
After a review of the listed works, it was noted that there are few works devoted to the analysis of historical texts in the Spanish language; mainly European corpus analyzes have been done. Furthermore, it should be highlighted that the texts in Spanish language have been automatically analyzed for purposes other than sentiment analysis, such as the transcription of handwritten documents or the obtaining of geographic data. That is why the development of a research project for automatic analysis of historic texts to identify emotions is an opportunity to contribute to research in the combined fields of computer science and history of feelings; a project that can be included within the domain of the digital humanities.
The majority of the related works constructed their datasets with particular features due to the inexistence of previous corpora with the necessary characteristics, such as the historical nature of the texts.
Historical texts for emotion identification
Long texts often have different characteristics –negated expressions, adverbs of intensity or degree, intensifiers, auxiliary verbs in some languages, etc.–that complicate the use of current sentiment classification algorithms on them. Starting from these difficulties, presented by texts longer than those that are currently the main subject of study of sentiment analysis (reviews, microblogging social networks, opinions, etc.), we seek to obtain a methodology that improves the automatic classification of extensive texts with different applications such as the detection of happiness and well-being, improvement of automatic dialogue systems, etc. [18]
In this case, extensive texts correspond to historical texts. For their analysis, we want to develop a methodology, based largely on text mining and machine learning ∖ deep learning, capable of analyzing longer historical texts in Spanish in the future.
At the same time, the sentiment analysis, in a historical way, is important because sentiments shape individual, community, and national identities. Some research centers for the History of Emotions use global historical knowledge, initially from the period 1100–1800, to understand the long history of emotional behaviors [19]. In this way, by analyzing texts created within this period of time, with a loaded content of sentiments, one can have access to a very useful lexical resource [20].
In the methodology proposed for identifying emotions, one of the first and most important steps is, therefore, the creation of the corpus of documents written mainly by cloister religious women. Since there is currently no public corpus with these characteristics, and, at the same time, it is interesting to access unpublished texts with the necessary characteristics, provided by Dr. Loreto, the first step is to build a corpus of sufficient size to be able to use as data input for the emotion identification model.
Corpus creation
In order to build a model that identifies emotions in long texts, the methodology to be followed can be divided into 3 main parts: (1) obtaining and pre-processing the corpus; (2) construction of an emotion identification model; and, (3) results analysis. A diagram of this methodology and the place of the corpus into it is shown in Fig. 1.

Diagram of the proposed methodology.
As mentioned in section 1, discrimination criteria is applied to the documents to be included in the corpus, being the most important the century of its creation and the type of texts (spiritual manuscripts written mainly by religious women).
Initially we had access to some unpublished texts, written by seven different women that lived in XVII and XVIII centuries in New Spain, seven autobiographies, and five spiritual dairies adding up to 12 different documents. These texts have been paleographed and translated into contemporary Spanish language by experts. The preliminary experiments –which are explained in greater detail in the following section–were conducted with this initial 12 documents. However, these texts were not enough to train a machine learning ∖ deep learning based model for emotion identification; therefore, it was decided that it was necessary to gather more texts with similar characteristics. Figure 2 shows the processes implemented for Phase 1 of the methodology, which consists of obtaining the texts for the creation of the corpus and their pre-processing.

Phase 1 of the methodology.
In the next stage, four other collections of documents were obtained and added to the corpus: Práctica de la Theologia Mystica, Diálogos Espirituales, Monjas y Beatas, as well as some works from Sister Marcela de San Felix; all of them previously published documents.
Práctica de la Theologia Mystica was written by the Irishman Michael Wadding, a Jesuit missionary who lived for twenty years in New Spain under the name of Miguel Godinez. In his work, under the Jesuit perspective, he updates mystical knowledge, integrating the intellectual method with the spiritual practice. The merit of this text is that the character was a confessor of nuns and the spiritual director of two of the most important novohispanic autobiographies.
Dialogos Espirituales is a heterogeneous work in nature, since it deals with writings were mostly from women in Hispanic America; it includes letters, poems, and plays. The theme of the writing is always religious, linked to a model of feminine holiness. From this work twelve documents are taken, all of them selected and contextualized by experts, but only the original transcriptions of the spiritual manuscripts were added to the corpus [21]. Monjas y Beatas presents five texts written or dictated by religious, pious, and laywomen, who lived a particular form of spirituality in the New Spain of the seventeenth and eighteenth centuries. These manuscripts can be considered a product of Catholic practices linked to the ascetic and mystical practices of the time. Three of them autobiographies, product of an order from the confessor. That, in addition to the narrative on their inner life, could also offer very important information about daily life. One more is dictated to a scribe; another corresponds to epistolary documentation [22].
Finally, also within the genre of female conventual literature, the work of Sister Marcela de San Felix stands out for its rare quality as a historical and social document and introduces us to the female literature of the religious communities of Spanish women of the 17th century. The complete work includes six spiritual colloquia, eight loas, twenty-two romances, other diverse poems, and a short biography of a sister nun. From these writings, 15 were selected to be included in the corpus and thus make it more extensive along with the other 3 works mentioned above. Table 1 shows a description of the documents contained in the corpus and some of their characteristics.
Corpus description
Once the expansion of the corpus is completed, the pre-processing stage shown in the methodological proposal continues. For the preprocessing of the corpus, a standard pipeline, which consists of the following steps, was defined: Standardization; Tokenization; POS labeling; Cleaning stop words; Frequency of terms (TF); TF-IDF.
After pre-processing the texts, the AWS Comprehend tool and its included models were used for sentiment analysis in order to analyze how they behave when processed by an existing sentiment analysis tool [23]. One of the steps is the sentiment analysis of the text, determining if the sentiment is positive, negative, neutral or mixed. This operation returns the most likely sentiment for the text, as well as the scores for each of the other sentiments.
It was observed that when automatically determining the polarity in the documents there is a very low precision compared to the feelings that any reader could identify in them with the naked eye. Another drawback found using this tool is the existence of many unrecognized words, for which a more exhaustive pre-processing of the transcripts must be carried out in order to perform a better analysis using this tool.
As part of the methodology, the corpus is also used to build a domain-specific lexicon (DEL), where each term will be associated with a polarity and an emotion. One of the first steps in the construction of a domain’s own lexicon is to define which terms will be part of it. Classic NLP techniques such as TF-IDF for the extraction of the most important terms within the collection of documents are a way to obtain the initial list of words (seed words) for the lexicon of the corpus [24]. Therefore, the following experiments with the dataset are aimed to creating a lexicon of the domain. An exploratory count of frequent terms resulted in: a considerable number of unknown words; use of abbreviations no longer used; inconsistent writing in terms of spelling; and, the realization that common words are a recurring set in each of the texts analyzed. Figure 3 shows a graph of the type of frequent terms found while conducting an exploratory data analysis. To clean stop words, an existing list from Python’s NLTK library in Spanish was used. The list was supplemented with the most common terms that should also be eliminated from the corpus, such as stop words with a different spelling, words inherent to annotations made during the transcription, such as the number of pages, or words that refer to images or illustrations within the original manuscript, such as the word “crismón”. Also with the help of experts in the study of corpus texts, a list of abbreviations, acronyms, and words with inconsistent spelling was created; these were replaced with their meaning in the texts. Some of the acronyms, abbreviations and non-existent words that were found are shown in Table 2.

Most frequent terms in the corpus.
Acronyms, abbreviations and non-existent words
In the experiments performed to calculate the TF-IDF factor of the corpus documents, n-grams were used to find collocations that are of importance within the analyzed documents. Some of the n-grams with the highest weight –the measure of the importance of a term within a corpus–in the documents are shown in Table 3. For each of the documents in the corpus, unigrams and bigrams were analyzed at the same time in the same experiment. Then, only bigrams, and, finally, trigrams. Larger n-grams were not considered since no important placements of dimension greater than three were found.
N-grams with the highest weight in documents
From the first 1,782 words with the greatest weight within all the documents, those that may have an emotional charge were extracted, eliminating auxiliary verbs or those that do not have an obvious emotional charge, as well as words that serve for amplification or negation. This led to a list of 1,000 terms that were first classified by their polarity using SentiWords, which is a lexical resource that contains approximately 155,000 English words associated with a sentiment score between -1 and 1 [25]. The words in SentiWords are in the #POS lemma format and are aligned with WordNet lists (which include adjectives, nouns, verbs, and adverbs). Thus, in a semi-automatic way, the polarity labeling of the seed words that were obtained through TF-IDF was carried out. The polarity labeling is considered semi-automatic because each term in Spanish was translated into English, searched for in the SentiWords list of terms and if there were more than one entry for a word but with a different POS tag, the proper one was chosen. Subsequently, the same terms were re-labeled with the advice of four different experts in the analyzed texts. This was done because some of the words’ polarities may change due to differences in historical context. These polarity changes can be seen in Table 4. Using the corpus-based approach, Table 4 also shows an example of some of the seed terms and the labels assigned to them. The set of emotions used to link the words of the lexicon to an emotion are those identified in the model of Plutchik’s wheel. In this case only the eight basic emotions of the wheel are used, no intensity or combinations are considered for the labeling as shown in Fig. 4.
Seed terms for lexicon creation and associated emotion
- Note that the terms with an (*) next to them are the ones that, depending on the context of the domain, change their polarity.

Plutchik’s wheel of emotions.
It is worth mentioning that POS labeling experiments of the texts was also conducted with Freeling [26]. This resulted in a labeling with errors when handling proper names, mainly due to their lengths, as well as errors with unused words or with different spelling, Therefore, in order to use POS labeling in later phases, it will be necessary to lemmatize and correct the spelling of the texts.
After conducting the previously explained experiments, and based on their results, it was observed that due to the nature of the texts, the optimal way to evaluate the emotion identification model once it is finalized is through the evaluation of data manually labeled by experts in this type of writing, because there is no existing validation set (Gold standard).
To carry out this validation, the texts labeled with emotions by experts must be compared against the texts labeled with emotions automatically thrown by the model once it is constructed. In this way, we proceeded to request the labeling by experts of some of the texts previously published. This way, the unpublished character of the texts that were initially considered for study is also preserved. For manual labeling, the Catma platform has been used, as it offers features such as web access, a configurable taxonomy of labels, output data in XML format, and the ability to do collaborative labeling where each user has her/his own instance of the document to label [27].
As a conclusion of these initial experiments with the data set, we can say that the most viable methodology is to continue with the construction and expansion of a domain specific lexicon with the advice of experts in the analyzed texts. This is due to the fact that many religious terms have domain dependent polarities, different to general-purpose polarities. This would also affect the emotion tied to the term significantly. The size of the corpus is already sufficient for its future use in an automatic classification model where it is intended to use the lexicon (DEL) as keywords for classification.
Manual labeling will be used to evaluate the results of the model in which lexicon-based approaches and machine learning ∖ deep learning will be combined.
Future work
The next stages to continue with the development of a methodology that can identify emotions in historical texts consist mainly of combining lexicon-based approaches and machine learning ∖ deep learning. A fundamental part will be defining the machine learning (ML) ∖ Deep Learning (DL) model or tool that will be used for the classification (ie, Support Vector Machine, Naive Bayes, BERT, etc.). Once a ML ∖ DL model is chosen, some of the possible approaches could be, on the one hand, the use of the lexicon as keywords to classify sentences and measure their similarity, later make use of the classified sentences for training the ML∖DL model and finally carry out the evaluation of the predictions against the labeling. manual labeling done by experts. Another approach could be to create class vectors, which represent the 8 classes to be identified; then using the corpus, calculate the cosine distance between each class vector and sentence vector generated from the sentences of the corpus, so the vector with the greatest similarity or greater than a certain threshold will be assigned to that class. Finally, in this case, manual labeling is also used to evaluate the performance of this method.
