Abstract
At the beginning of his career and for about ten years, Pietro Torasso made fundamental contributions to speech and language processing for Italian. This article describes those early contributions and includes some of the lessons, scientific and otherwise, that many of his students from thirty years ago still carry with them.
Speech synthesis and recognition for Italian
The decade from the mid seventies to the mid eighties was a fundamental era for speech recognition and synthesis: in the Unites States, the work by Lawrence Rabiner at Bell Laboratories, and by Raj Reddy at Carnegie Mellon University, established the foundations of the field. Rabiner gave us the Hidden Markov Model (HMM) [32, 33], while Reddy was involved in early projects sponsored by the Defense Advanced Research Project Agency (DARPA) that produced some of the early speech recognition systems Hearsay, Dragon, Harpy, and Sphinx I/II [18, 34]. Naturally, the focus of this research was English. However, Italian researchers were not far behind. The group CENS-CNR in Torino was the first to work on speech processing in Italy, both understanding and synthesis - among its members, Renato De Mori, Piero Laface, Marco Mezzalama, Silvano Rivoira and Pietro Torasso.
Torasso’s “laurea" thesis from 1974 [47] focused on speech synthesis, specifically the production of nasal consonants in Italian. His speech synthesis work continued in collaboration with Lesmo and Mezzalama producing a full text-to-speech system for Italian, which first converted a written text into its corresponding phonemic representation by means of a finite state transducer. The second step was to determine pause positions, phoneme durations and pitch contour on the basis of the punctuation marks and the accent marks present in the text. The idea of including prosodic features, such as modeling pitch contours, in synthesized speech was extremely advanced at the time, as demonstrated by the fact that even today this is not a solved problem [17]. This is also the first paper in an international publication we could track down, where Lesmo and Torasso appear as co-authors. 1 As many of us witnessed, from the mid ’70s to the mid ’80s, Lesmo and Torasso’s collaboration on speech and language processing produced several fundamental papers on the computational treatment of the Italian language, to be discussed in Section 2.
Apart from the two papers on speech synthesis just discussed, early in his career Torasso devoted much work to speech understanding, always for Italian. In two initial papers [10, 11], the researchers (De Mori and Laface besides Torasso) proposed to use fuzzy set theory [21, 53] to model the uncertainty of the relationship between spectral segments (individual letters, pseudo-syllables or syllables) and their corresponding phonemic transcriptions. Fuzzy set theory was rendered via a fuzzy grammar. Subsequently, in a fruitful collaboration with Rivoira that resulted in a series of eight papers [35–42], the two researchers presented progressively more linguistically sophisticated models to recognize first isolated Italian words, and then Italian sentences composed of words pronounced in isolation.
Rivoira and Torasso approached the word/sentence recognition problem in a hierarchical fashion: this applied earlier in their work, to the mapping from acoustic signals to lexical recognition [35–37]; and later, to the syntactic-semantic models that further reduce, or add to, the recognition hypotheses [39, 42]. Their work married signal processing methods derived from speech processing proper with formal language descriptions via fuzzy automata (for the lexicon) [21, 53] and transition network grammars (for the sentences) [51].
In their earlier work [35–37], Rivoira and Torasso first preprocess spectral samples – output by band-pass filters – to extract global features, on the basis of which a word is segmented into vocalic, non vocalic, silence or explosive tracts. The sequence of segments so obtained is parsed via a fuzzy translation grammar, into a fuzzy set of sequences of broad classes of phonemes like vowels, stops (e.g., b, d, g), fricatives (e.g., f, v), sonorant (e.g. l, r) and non-sonorant consonants. 2 These broad descriptions of the input word (a first level description) are used by the lexical recognizer to prune the most unlikely candidates. A second level description at the phonemic level of the retained hypotheses is then obtained by refining the segments of the first-level description whose stationary parts are roughly context-independent (as in the case of vowels, fricatives and stops). The refinement is performed with appropriate measurements, e.g. frequencies of formants for vowels, average and maximum energies for stops and fricatives. This phonemic description is then verified by means of the second level fuzzy automata associated to those candidates (words) selected at the first level; each automaton represents both a prototype phonemic description for each word, and its most frequent distortions. The approach was evaluated on a set of fifty words pronounced by a single speaker, with a word rejection rate of 8% and misclassification close to zero. Rivoira and Torasso claimed that both first- and second-level fuzzy automata could potentially accept any variation in pronunciation, since they could be learned automatically via experiments with different speakers and could continuously be updated through grammatical inference, as described in [48]. Again ahead of their time, Rivoira and Torasso envisioned learning as interaction with a human teacher. The teacher would check for the correspondence of a string of broad/phonemic symbols (produced by the classifier for an input word) to an acceptable broad/phonemic description of that word, and then would decide whether it should or should not be learned by the inference machine.
In their later work [39, 42], Rivoira and Torasso considerably enriched their model in order to recognize full sentences (albeit composed of words pronounced in isolation). The domain they chose was the blocks world, made popular at the time in Artificial Intelligence (AI) by research on planning and Natural Language Understanding (NLU) [15, 50]. Sentences included definitions such as Definisci come beta la piramide nella cella uno cinque (Define as beta the pyramid in cell one five), commands such as Metti sul tavolo la sfera rossa (Put the red sphere on the table), and questions Dimmi la posizione di ogni cella libera (Tell me the position of each free cell). They note that the vocabulary size was 102 and the average branching factor, 25. The speakers of those sentences were asked to leave short pauses between words in order to improve recognition.
The same acoustic-phonemic processor, based on fuzzy automata, was still used. A new major component was added to represent syntactic-semantic information, namely, a transition network grammar (TNG), equivalent to a push-down automaton, with subnetworks representing substructures (such as phrases). As the authors note [42, p.48]:
In our system non-terminal symbols and the related subnetworks represent syntactic-semantic constructions which may be present in different contexts, while the terminal symbols are the words in the lexicon.
In current parlance, Rivoira and Torasso’s TNG’s would be described as semantic grammars, since the categories (non-terminals) encode semantic classes (category or CAT) such as source, destination, block; at the same time, the subnetworks expanding these non-terminal encode syntactic information as well, for example the source subnetwork encodes two paths, one labeled as dal (from the, where the is marked as masculine); and the other as dalla (from the, where the is marked as feminine).
The acoustic-phonemic processor provides an ordered list of hypothesis for each word. A sentence is processed with a left-to-right, depth-first strategy. The presence of a specific word in a given context is predicted when the interpretation for a sentence is expanded to the right, by intersecting the words allowed by CAT C (top-down predictions) with the list of lexical hypotheses provided by the acoustic-phonemic processor (bottom-up predictions). The lexical hypothesis with the best score is then chosen, if one exists; otherwise, a dissimilarity score between the words predicted by CAT C and those predicted bottom-up is computed; the word with the best dissimilarity score is then hypothesized (please refer to [42] for further details on how entirely missing words may be recovered, or extra words skipped). In a test set of 40 sentences, 39 were correctly recognized at the syntactic-semantic level, even if almost all sentences resulted in at least one misclassification error.
Putting these papers in the context of the state-of-the-art of more than thirty years ago, the reader is struck by how innovative they were in their hierarchical approach to speech recognition. Whereas today we have access to both hardware and data that were not even imaginable at the time, hierarchical processing, and bottom-up hypotheses and top-down predictions co-constraining each other, are now touchstones of computer science software development, including in speech and language processing. But Rivoira and Torasso were already advocating and using them almost 40 years ago. It is of note that one of their papers, [39], which introduced the idea of syntax and semantic processing constraining speech recognition, is still being cited today: according to Google Scholar, it has been cited 58 times in US patent applications and disclosures since 2010.
Natural Language Understanding for Italian
The work we described so far pertained to speech processing, namely, to the computational treatment of spoken language. Originally, this area of research grew out of signal processing and information theory. At the same time, and in parallel, the area of NLU was taking shape: NLU was considered a subarea of Artificial Intelligence, and was grounded in formal languages research (the Chomsky hierarchy of formal languages, their corresponding grammar models, and their computational counterparts of finite state and push down automata). By and large, NLU was not concerned with speech processing, but assumed that input would be written/typed.
Some of the work Torasso did with Rivoira was already stepped in ideas stemming from NLU rather than from speech processing, for example the usage of formal languages, and syntax and semantics co-constraining the hypotheses put forward by the speech recognizer. Not surprisingly then, some of the same themes emerge in the NLU work that became a major topic of the collaboration between Lesmo and Torasso in the 1980’s. This collaboration was fueled by Leonardo’s passion about studies in natural language (his unaccomplished dream was a machine that could read and understand “I promessi sposi", the novel by Alessandro Manzoni that belongs to the cultural background of every Italian); and by Piero’s passion about AI methods, which were considered outside the mainstream of Computer Science at the time (he was teaching and writing a textbook on the processing of non numerical information, as AI was officially named in the course syllabus). After the mid-eighties, their collaboration ended, with Lesmo continuing his work on NLU for many more years, whereas Torasso devoted himself to knowledge representation and diagnostic reasoning, as discussed in other articles in this special issue.
The contributions of Lesmo and Torasso’s approach stemmed from two major (philosophical) interests. The first was their goal of developing an end-to-end system, a so-called “flexible” interface to data bases, thus promoting what would then become a major research area, human-computer interaction. In the late ’70s, this was a viable option for researchers in practical systems (as they were), while theoretical researchers were addressing specific formal systems or algorithms, and studying their expressive power and complexity, respectively. The second interest, which actually is in line with the first, was their focus on Italian, a language different from English, the only language studied in mainstream Computational Linguistics at the time.
This interest towards a functioning system as well as one based on Italian, led to the development of a prototype, FIDO (Flexible Interface for Database Operations). FIDO was based on a double operational level of syntax representation, that allowed the user to use Italian to access the data stored in a relational database [22]: the first level consisted of condition-action rules for building the syntactic tree from left to right (as the words come in) in a quasi-deterministic fashion, operating a disambiguation as early as possible and proceeding with a strongly connected structure at all times; the second level consisted of restructuring rules (so-called “natural changes”) to deal with parsing errors. These errors may happen in different situations, for example in case of garden path sentences (i.e. grammatically correct sentences that start in such a way that a reader’s most likely interpretation will be revealed to be incorrect), such as
“The horse raced past the barn fell.”, in English, or “La vecchia porta la sbarra.”, in Italian,
or incorrect disambiguations due to lack of information at the point of parsing ambiguity (such as, e.g., the resolution of the PP attachment at the underscore position in
“The boy ate the ice cream _ with chopped hazelnuts” VS. “The boy ate the ice cream _ with the silver spoon”
or the analogous phenomenon in Italian).
As these examples prove, FIDO was inspired by psycholinguistic ideas on parsing, especially those concerning determinism, brought to the fore in NLU by Marcus’ approach to parsing [29]: at the time, mainstream parsing was focused on the issue of parallel processing (as exemplified by the Earley and CKY approaches); determinism was an issue mostly dealt with in psycholinguistic studies.
The syntax representation was based on dependencies, a choice ahead of its time. In fact, in the early ’80’s, mainstream syntax was based on constituencies, following the work published by Chomsky, from transformational grammar [4], to government and binding theory [5], which then evolved into the principles and parameters approach [6]. In particular, these frameworks were all founded upon a context-free grammar base, which was naturally suitable for a fixed order of constituents (the so-called Subject-Verb-Object, or SVO sequence, characteristic of English); to generate different constituent orders, such as the SOV of German or Japanese, the Chomskian approach postulated complex sets of parameterized rules. Since Italian features a freer order than English (though not as free as German or Japanese), Lesmo and Torasso turned their attention to dependency as opposed to constituency, represented formally by theories like Word Grammar [19] and Dependency Grammar [30], mostly applied to Slavic languages and free order languages in general. This choice, though driven by the free constituency order, was precursor of current mainstream approaches to practical syntax (see, e.g., the Stanford Dependency approach [8, 9]).
Instead of evaluating the theoretical complexity of their parsing approach or the expressive power of the grammar implicitly underlying FIDO, Lesmo and Torasso decided to devote their efforts to analyze realistic cases of linguistic input, namely ill-formed and coordinated sentences, respectively. These efforts were moved by their interest towards a functioning system in the real world (another issue ahead of its time). The analysis of ill-formed sentences was particularly aimed at sentences containing ellipses [24], e.g.,
“John loves Mary and Susy _ Fred”
where the underscore marks the point at which the elliptical reconstruction must take place, a methodology at odds with a word-based framework (the verb phrase Suzy loves Fred, where the elided verb loves must be retrieved from context, needs to be reconstructed). They further extended their approach to the analysis of coordinated constructions in general [25], a topic that even today does not receive adequate treatment, notwithstanding the advances in quantitative parsing techniques in the large.
Finally, in their research on FIDO, they also focused on the relationship of parsing with semantics. In particular, starting from the notion of the interface to be developed, they addressed first, the issue of efficiently storing semantic information [23] and then, the issue of optimizing the resulting query, expressed in a relational algebra [27]. Finally, they promoted strict cooperation between syntax and semantics [26]. Syntactic knowledge was paired with semantic knowledge and decisions about ambiguous cases were weighted by taking into account the syntactic and the semantic structures produced by the left-to-right parsing process.
The FIDO system was fully implemented in FRANZ LISP on a VAX 11/730 computer; later it was expanded to the system GULL (General Understander of Likely Languages) and ported to PC and laptop. The latest developments are still downloadable at the URL http://www.di.unito.it/~gull/nlp-web/systems-nlpg.html
The interplay of interests between syntax and semantics, but also speech processing and knowledge representation, came full circle in a pair of theses, by Massimo Poesio and Paolo Baggia, that are among the last ones that Lesmo and Torasso oversaw together. Baggia and Poesio worked in Egidio Giachin and Claudio Rullent’s group at what was then CSELT (Centro Studi e Laboratori Telecomunicazioni). Their focus was on semantic interpretation for a spoken interface to a data base; the knowledge representation formalism they investigated was that of Conceptual Graphs [45, 46]. As discussed in [2, 31], the main computational task was to combine together adjacent word hypotheses in order to create phrase hypotheses (PHs). In order to support real-time processing, the linguistic knowledge representation was supposed to above all be efficient, namely, to have a low computational cost, and maintain a low number of PHs. On the other hand, for the developer, it was necessary that the representation be easy to declare, interpret, and maintain. For this reason, syntax and semantics were kept separate as much as possible. At that time at CSELT, dependency grammars had been picked as the syntax representation of choice, beyond the project described in [2, 31]; Conceptual Graphs, which were fairly new at that point, had been suggested by Torasso as an appropriate semantic representation. Given those two representations, an off-line compiler would compute internal representation called Knowledge Sources (KSs), essentially expanding dependency grammar rules with semantic constraints derived from the Conceptual Graphs (for example, that the governor verb in a rule is send, and that its noun dependant must be a person). KSs, not the dependency grammar per se, were then used to perform parsing in service of the speech recognition system.
Pragmatic Processing. So far, we have discussed Torasso and collaborators’ approach to syntactic and semantic processing. However, a true computational treatment of language needs a third level of modeling pertaining to pragmatics, namely, to the meaning of language in context [28]: from a practical point of view, in NLU, pragmatic processing covers meaning interpretation and inferences that go beyond the literal meaning of a sentence, and depend on the larger context. Phenomena that are traditionally considered as pragmatic include: anaphora and referring expressions – typically, noun phrases ranging from pronouns (you, she, they in English, tu, lei, loro in Italian) to definite descriptions (the man in the red hat); speech acts [1, 44], namely, the speaker’s intentions underlying a sentence (compare can you run a marathon? to can you pass me the salt?: the first is a question pertaining to the hearer’s ability, the second is a command, couched as a polite request); and dialogue unfolding according to Grice’s cooperative principle and attendant maxims [16] (the cooperative principle: Make your contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged).
Again Lesmo and Torasso’s foresight is shown by the fact that, at a time when the NLU research community was focused on syntax and parsing to a very large extent, some of their research was in pragmatics. Several of the laurea theses they oversaw in the mid ’80s - including those by Marisa Mora, Gianfranco Lanza, Paolo Pogliano - dealt with determiners, anaphora, and reference resolution [13, 14].
Another topic in pragmatics pertained to what was then called over-answering [49], or generation of cooperative answers [20]; Di Eugenio’s initial foray into NLU was in this area, in collaboration with Nicoletta Bersia [3, 12]. It was observed that answers to language queries need to address the failed presuppositions underlying the original query; presuppositions are implicit beliefs that are taken to be true, such as the existence of course CS123 in the query Quanti studenti hanno passato CS123 il 21 giugno? (How many students passed CS123 on June 21st). An answer such as Nessuno (Nobody), while true, is misleading to the user if the reason why no student passed CS123 on that date is that CS123 doesn’t exist, or was not offered that academic year, or it was but no exam for CS123 took place on June 21st. Bersia and Di Eugenio’s solution consisted of exploiting both levels of query representation that the FIDO system generated: the conceptual level closer to the language terms (today, we would say the conceptual query was expressed in terms of the entities and relationships in the domain ontology); and the logical query, i.e., the optimized DB query. On the basis of those two, they generated what they called condition graphs, that mirrored the nesting of conditions associated to the logical operations of select and join. These conditions graphs were then used to understand which condition(s) from the query had failed with respect to data stored in the DB; then, heuristics were used to generate appropriate answers in Italian.
Memories and life lessons
In conclusion, we hope that our overview of the research by Torasso and his collaborators, in particular Lesmo and Rivoira, in the decade from about 1975 to 1985, has managed to highlight how forward thinking, comprehensive and ambitious they were in their exploration of a variety of research themes in speech and language processing. Only recently, more than 30 years later, some of those themes, for example conversational interfaces, have resurfaced as viable technologies.
As a final note, we would like to conclude this paper, by paying tribute to Torasso the man, beyond Torasso the researcher.
Many “laureati" from Informatica in Torino remember Piero Torasso as a teacher, not only for our profession, but for life. As a matter of fact, all of us who graduated in the ’80s were advised by both Lesmo and Torasso, hence we remember both with affection, and we consider both as our research “fathers". As regards Piero specifically, we remember as unparalleled his devotion to his former students’ careers, including how to effectively promote one’s research via publishing and presentations. For example, after obtaining her laurea in 1985, Di Eugenio, was a research associate in Torino from 1985 to 1987. She remembers many discussions with Piero regarding a career in research, advice from him on “concorsi”, and him expressing some regret when she decided she would stay in the USA for a PhD; and, subsequent lively, yearly discussions for almost thirty years, on research and academic “politics" in Italy and in the USA. Barbara feels very lucky that in Fall 2016 she managed to spend a week visiting the Dipartimento di Informatica, with Piero as her host. Massimo Poesio credits Piero as the first one to put the idea of a PhD in his head. And one of Massimo’s early memories of Piero is that, the day Massimo was supposed to submit his laurea thesis, the librarian for Informatica discovered one book Massimo had borrowed was missing (Massimo is certain he had returned it): Piero took his own copy of the book in question and gave it to Massimo, so that he could return it to the library. This kind gesture is emblematic of Piero’s nature, and his support for students, past and present. We will forever miss his kind words and support; likewise, we will always miss Leonardo’s keen eye for detail, and the ironic gaze he directed at the world, including the world of research.
Footnotes
The distinction between sonorant and non-sonorant applies at a different level of recognition - phonetic rather than phonemic. It was useful at the time: since the acoustic classifier did not have enough computational resources to fully distinguish between specific phonemes like m and n, it would classify them as sonorants, and pass them to the second level of analysis.
Acknowledgments
We thank Silvano Rivoira (Politecnico di Torino) for his feedback on the speech processing section; and Massimo Poesio (Queen Mary University of London), Nicoletta Bersia (Telecom Italia), Celeste Gallo (CSI Piemonte), Marisa Mora (Reply), for sharing memories and details on their “tesi di laurea”.
