Abstract
This work presents a method for data gathering to construct a corpus related to speech disorders in children; such corpus will serve as the base to generate some semi-automatic ontologies, in order to become a computational model to support therapists for diagnosis and possible treatment. Speech disorders, phonemes and some additional information are classified using taxonomies obtained from speech disorders specialized literature. Based on the obtained taxonomies, the ontologies, which structure and formalize concepts defined by the main topic authors, are developed. The ontologies are constructed following some parts of classic methodologies and their subsequent validation is made through competency questions. The development of the model is based on Natural Language Processing (NLP) and Information Retrieval (IR) techniques. Integration of the ontologies is made to be able to make a classification based in problematic phonemes; this is suggested as a complement to the diagnostic tool in the model.
Introduction
A speech or language disorder refers to problems of communication or other related areas, such as oral motor functions. These delays and disorders vary from simple substitutions of sound to the inability to understand or use the language or motor-oral mechanism for speech and feeding. Some causes of speech or language impairments include hearing loss, neurological disorders, brain injury, intellectual disability, drug abuse, impairments such as cleft lip, and vocal abuse or misuse [1]; however, the cause is often unknown [2].
According to the INEGI (National Institute of Statistics, Geography and Informatics) it is estimated that, in 2014, 6.4% of the country’s population (7.65 million people) reported having at least one disability [3]. The importance of an early detection and diagnosis of a speech disorder – which can have a social, economic and educational impact–, lies in the fact that the prognosis of the treatment depends on the cause of the disorder and on an opportune treatment [4]. One of the first clues to detect a speech disorder is the mispronunciation of characteristic phonemes; those phonemes that are more difficult to pronounce could help to identify the existence of a speech disorder in children from certain age [5].
The ICT’s (Information and Communication Technologies) have become a helpful tool and natural part in the diagnosis and treatment of illnesses and conditions such as speech disorders in children to provide proper healthcare [6]. This kind of resources are used in the healthcare area for transmission of data, including databases or data obtained by retrieval to support the diagnosis and treatment, both on site and remotely by healthcare professionals, patients or their relatives.
Ontologies provide an explicit and well defined structure for a clear and precise representation of a large amount of data related to a particular domain, in this case speech disorders and, therefore, they become an instrument for diagnosis.
An ontology is proposed to organize and search information such as different disorders, characteristics of each disorder, therapy theory, taxonomy of speech disorders, problematic phonemes, and some other useful information for both the therapist and the patient, as well as relations between all of them.
One of the first steps in the development of this ontology is the conformation of a Corpus, in this case of documents related to the domain of speech disorders. A corpus as a large collection of texts; a body of written or spoken material on which a linguistic analysis is based. The corpus analysis provides lexical, morphosyntactic, semantic and pragmatic information [7].
This document is organized as follows: Section 2 presents the state of the art through the discussion of some works related to the subject of this work; section 3 talks about the corpus as a source of data and about taxonomies; section 4 explains the development of ontology; the subsections detail important parts for the design of the ontology and, at the same time, the implementation of the ontology in the Protégé software is shown, as well as the use of its logical reasoner for consistency tests; section 5 mentions the future work that can be done; finally, in Section 6, the current results and conclusions are summarized, followed by the references.
Previous work
In the field of speech and language, several works have been carried out using Information and Communication Technologies (ICT), focusing on some specific ailments [8], on the automatic classification of the quality of pronunciation when some disorders are treated [9] or, in an expert system for the initial evaluation of children with possible speech disorders [10]. A so-called smart ICT ecosystem that includes the management of electronic medical records, standardized vocabularies, a knowledge database, ontologies for concepts within the domain of speech and language and expert systems focused on supporting speech and language pathologists, physicians, students, patients, and their relatives can also be found [11]. There are also tools for training professionals in the field of speech disorders based on ontologies and e-learning, which support future language therapists in their training process, as well as in the development of practical abilities [12]. Regarding language therapies, a mobile application that integrates therapy activities for children and uses colloquial language, as well as games from the state of Chiapas, México, has been developed [13]. There are even ontology systems that cover various aspects of speech and language therapies, with initial assessment and patient profile, tests performed, catalog of physicians and therapists, list of disorders, speech and language fields, therapy and plans and follow-up exercises among others; these use OpenEHR ontologies and constructions [14].
Regarding the construction of the corpus, the main classical techniques have not changed much, and the texts in a corpus must be in electronic format. Therefore, the fastest way to build a corpus is through the collection of data that is already digitized, or relying mainly on the transcription to electronic format of the audios or documents [15].
Previous works have some limitations because some of them are not made for less specialized users, such as primary school teachers, or taxonomies and ontologies focus on a single or very specific disorder or are directed only at the therapy part of repetition apart from the entire diagnostic process.
Methodology
Corpus building
To build a corpus, it is necessary to gather a large amount of documents relevant to speech disorders through a web crawler. Once a representative number of these documents is obtained, they must be preprocessed in several steps to clean and standardize the data through algorithms such as normalization and stemming. The purpose of building this corpus is to obtain a data source for the extended taxonomy and to validate the classes and subclasses in the ontology mainly.
The construction of a corpus is divided into two stages: design and implementation. A good practice in the design stage is to define what, ideally, the corpus would have in terms of the quantity and type of language, and then adjust the parameters as the building progresses, keeping a careful record of what is in the corpus, so it can be added and amended later, and so that, if others use the corpus, they know what is in it [15].
The main resource for collecting the information to build a corpus is a web crawler. At its core, it is an element of recursion. It must retrieve the content of the page from a URL, browse that page for another URL and retrieve that page, ad infinitum. To find relevant documents for the domain, and not just a list of links and random data contained in the seed page, it is necessary to establish a primary dictionary at the beginning of the crawling, and the retrieval of each document is conditioned to contain at least one of the dictionary terms. This dictionary, or rather a lexicon, is composed of some of the most significant words within the domain. Another way to complement the corpus is to include synonyms, hyponyms and hypernyms of the original terms to gather more documents [17]. The taxonomy mentioned in the following subsection, as well as the advice of experts in the field, was used to obtain the initial terms for the lexicon. Figure 1 shows a diagram of the steps to build and process the corpus.

Diagram of the steps to build and process a corpus.
Furthermore, some data about the corpus is now presented in Table 1.
Some outline data from the corpus
Starting with a taxonomy of speech disorders proposed by Gallego and Gallardo, and retrieving data from the corpus, the next step is to expand the taxonomy and include all speech disorders referred by the retrieved documents [18].
The ontology only focuses on the branch of speech disorders of the previous taxonomy, so this branch is expanded to more subcategories. The ontology also integrates other taxonomies on the etiologies of speech disorders, therapy strategies, people involved in diagnosis and treatment, and the signs and symptoms; all this information is recovered from the corpus previously constructed by IR algorithms. Once the five different taxonomies are complete and integrated, the next step is the development of the ontology using those taxonomies as a basis [19].
Ontology development
First, the scenario in which the ontology is applied must be defined, followed by the generation of the so-called “competency questions” in natural language, whose objective is to determine the scope of the ontology. These questions and their corresponding answers are used to extract the main concepts, as well as their relations, properties and axioms within the ontology. The formality of this method allows to transform informal scenarios into computational models. The elements for the design are the following: construction of the taxonomy for the knowledge base to be represented; attributes and relations within the classes; and, rules and attributes of instance.
The use of ontologies to represent a knowledge base within a given domain is intended to facilitate the understanding of that domain and to obtain better information on the subject [20]. The relevant information about speech disorders is the classification of each disorder with its own subclassifications to correctly classify the disorder presented by each patient, the signs and symptoms produced by each disorder -these are the keys that the therapist should look for-, the etiology that could affect the course and the results of the therapy, and the different parts of the therapy, first with an assesment strategy and then with an intervention strategy directed by the therapist. Once the scenario for the area of competence of the ontology is defined, the set of taxonomies can be used to arrive at a definition of the ontology classes (see Fig. 2) and the relations between them; a series of questions that are expected to be answered through the querying of the ontology is also defined. A formal definition is made for the classes and their attributes, as well as for the description of the relations and axioms of the ontology.

Main classes of the ontology.
These questions are an important part of the ontology design steps because they allow defining the domain and scope of the ontology.
The proposed ontology seeks answers to questions such as the following: What are the symptoms of a speech disorder? How many types of speech disorder are there? What is the therapy for a speech disorder?
The ontology knowledge base must be able to answer such questions. In this phase the questions are presented in natural language.
Definition of classes
Five different main entities were found after an analysis of the competency area scenario (see Fig. 2). A mixed strategy (top-down and bottom-up) was used to identify the main concepts [21]. Table 2 has a brief description of each of the main classes in the ontology.
Definition of classes
Definition of classes
Within the ontology, there are restrictions with respect to the ontology classes. To begin to describe these restrictions it is necessary to consider the relations between classes. Table 3 explains some of the relations identified, and each has an inverse relation, also represented in the ontology.
Relations between classes
Relations between classes
The axioms that define the rules for ontology are established by the characteristics and existential constraints of non-taxonomic relations. The object properties in Protégé are relations between two classes [22].
The characteristics of the relations can be seen as functions, and in Protégé they are called property characteristics. Some of these characteristics are assigned to each of the object properties according to the type of relation between classes represented by them. An example can be seen in Fig. 3, where the property characteristics are assigned to the object property Evaluates_a and its inverse Is_Evaluated_by, being Functional, Inverse Functional – in the case of inverse property–, Asymmetric and Irreflexive the assigned characteristics, depending of the behavior of each object.

Property characteristics for relations.
The functional characteristic indicates, for a given relation, that there may be, at most, a range class related to the domain class through the property. And, if a property is inverse functional, then it means that the reverse property is functional. If a property P is asymmetric and the property relates class a to class b, then class b cannot be related to class a through property P. And, finally, if a property P is irreflexive, it can be described as a property that relates class a to class b, where class a and class b are not the same. Figure 4 shows the graph that represents those relations.

The Evaluates_a and Is_evaluated_by relations represented as a graph.
Other restrictions that help to describe and define classes are the quantification restrictions, in this case the existential and universal restrictions. Mainly, the quantification restrictions found in the ontology are existential; this means that a class of individuals who have at least one (some) relation along a specified property with an individual who is a member of a specific class.
To test the consistency of the proposed ontology, Protégé’s logical reasoners (HermiT and Pellet) are used with the probe class technique. This means adding an inconsistent class to prove the integrity of the ontology. In this case, a new class was added: InconsistentDisorder, which is a subclass of Articulation_disorder and of Voice_and_Resonance_Disorder, both disjoint classes.
After invoking the reasoner to test the consistency of the aggregated class, an error is shown because its super classes are disjoint from each other. The result of this test is shown in Fig. 5.

Probe class characteristics.
The previously created classes only have the necessary conditions to describe them; nd these types of classes are called Primitive Classes. A necessary condition means that if something is a member of that class, then it is necessary to meet certain conditions. Using the necessary conditions, it is not possible to use these conditions in reverse. This means that it is not possible to say that if something fulfills the conditions, then it must be a member of that class. On the other hand, if an individual meets a set of sufficient conditions, that is enough to determine that any individual (random) must be a member of that class. A class that has at least one set of necessary and sufficient conditions is known as a Defined Class. Some examples of defined classes (Patient and Therapist classes) in the ontology are shown in Fig. 6 along with their sets of conditions [22].

Defined classes in the ontology.
With all these characteristics and constraints that define and describe the classes, the ontology can now be used to answer the competency questions and infer the knowledge, and the information may be available to a greater number of users as therapists and patients.
A way to grant earlier access to speech therapy could be a classification of the disorder with the use of a model based on the type of pronunciation problem presented by the children. So, a new ontology with classes based on a set of characteristic problematic phonemes for each speech disorder is proposed. These phonemes are classified in three main groups in the Spanish language: voiced, voiceless and vowels. This information is contained in a new main class in the ontology named Problematic_Phoneme.
The creation of a new ontology and its integration with the previous ontology for classification of speech disorders allows to formulate new competency questions to outline a broader scope for the ontology [24]. Questions like the following allow to shape new characteristics in this ontology: What is the set of phonemes linked to a specific disorder? With a single problematic phoneme could be identified a speech disorder? If two different disorders have a common set of problematic phonemes, how could each disorder be distinguished?
After an analysis of the scenario from the competency area, in some cases, additional information could be used to better classify a suspected speech disorder. Thus, another class was added to the second ontology: Additional_Characteristic. It includes useful information for a better classification of the suspected speech disorder, such as abnormal speech speed or repetition of certain phonemes. Table 4 presents a definition of these two new classes and Fig. 7 shows the graph of the taxonomy once integrated with some branches from the original ontology.
Definition of new classes
Definition of new classes

Graph representing the taxonomy for the proposed classification.
With the inclusion of these classes, there are new non-taxonomic relations. Some identified relations are, for example, Has_difficulty_with and its inverse Causes_difficulty_in, with domain and range in the Speech_disorder and Problematic_phoneme classes in the former and vice versa in the latter. Figure 8 shows these relations with integrated classes.

Non taxonomic relations (object properties) between new integrated classes and Speech_Disorder class from original ontology.
Each set of problematic phonemes was listed as instances of subclasses Vowel, Voiced_phoneme and Voiceless_phoneme which are subclasses of Problematic_phoneme, some of them can be seen in Fig. 9.

Subclass Voiced_Phoneme and its instances.
This section presents the tests and queries carried out with the Protégé 5.2 to obtain the results and intent to answer the competency questions.
Languages DL Query and SPARQL Query [25] were used to develop the tests and queries to retrieve information from the ontology. Examples of queries answering competency questions are listed below.
The importance of formalizing the competency questions and the generation of results resides in the fact that it is a way to evaluate the ontology itself. The above queries were intended to answer some of the competency questions; an ontology can produce satisfactory results firsthand when a query is made this way.
As a result of this work, a consistent version of an integrated ontology was obtained; the taxonomy constructed with five main classes was defined and two more were added for an additional classification; more than one hundred subclasses, as well as several relations among the individual were identified; the existential restrictions and characteristics for the classes were set; and the definition of primitive and defined classes was completed. All these turned this ontology into a model that is used for information description, and, using it to model the information of a structured environment, it allows to answer questions related to the area of competence. The population of the ontology was made with data collected by therapists working in public institutions, such as elementary schools. The instances of the Person class are the individuals represented with said data. This ontology could be useful in situations in which a therapist needs theoretical information to make an accurate diagnosis, formulate a therapy plan or obtain a report with the characteristics, symptoms and signs present in a patient. Moreover, the ontology can be used to provide information to the patient and her family about a specific speech disorder. However, in order to be able to use it in any scenario it is necessary to make use of other resources that allow users without experience in query languages to access the information contained in an ontology and turn it into a useful instrument for diagnosis.
Ongoing work
The use of an audio transcription software, as well as analysis of said data to detect some speech disorder and its possible classification using the current ontology as a knowledge-base for this purpose is proposed. A simple analysis of transcription text using metrics such as Levenshtein distance could detect the insertion, omission, substitution or repetition of speech sounds. Google Cloud Speech-to-text has useful features such as real-time transcription, recognition of variation in the pronunciation of sounds and less word correction [26].
Footnotes
Acknowledgments
We would like to thank to the Vicerrectoría de Docencia from the Benemérita Universidad Autónoma de Puebla for supporting this work.
