A multimodal grammar of artificial intelligence: Measuring the gains and losses in generative AI

Abstract

This paper analyzes the scope of Artificial Intelligence (AI) from the perspective of a multimodal grammar. Its focal point is Generative AI, a technology that puts so-called Large Language Models to work. The first part of the paper analyzes Generative AI, based as it is on the statistical probability of one token (a word or part of a word) following another. If the relation of tokens is meaningful, this is circumstantial and no more, because its mechanisms of statistical analysis eschew any theory of meaning. This is the case not only for the written text that Generative AI leverages, but by extension image and multimodal forms of meaning that it can generate. The AI can only work with non-textual forms of meaning after applying language labels, and to that extent is captive not only to the limits of probabilistic statistics but the limits of written language as well. While acknowledging gains arising from the brute statistical power of Generative AI, in its second part the paper goes on to map what is lost in its statistical and text-bound approaches to multimodal meaning-making. Our measure of these gains and losses is guided by the concept of grammar, defined here as a theory of the elemental patterns of meaning in the world—not just written text and speech, but also image, space, object, body, and sound. Ironically, a good deal of what is lost by Generative AI is computable. The third and final part of the paper briefly discusses educational applications of Generative AI. Given both its power and intrinsic limitations, we have been experimenting with the application of Generative AI in educational settings and the ways it might be put to pedagogical use. How does a grammatical analysis help us to identify the scope of worthwhile application? Finally, if more of human experience is computable than can be captured in text-bound AI, how might it be possible at the level of code to create a synthesis in which grammatical and multimodal approaches complement Generative AI?

Keywords

Grammar language technology artificial intelligence multimodal practices

Grammar and anti-grammar

On statistical language processing

“Whenever I fire a linguist, the system gets better.” Looking back over his career at a lifetime achievement award ceremony, Robert Mercer was quoting something his boss said to him in the 1980s when he was working at IBM (Mider, 2016). The team had been developing statistical language analysis. Mercer and his colleagues published the results of their seminal work in 1990 (Brown et al., 1990). The boss who may have uttered this apocryphal statement or something like it was Fredrick Jelinek, a co-author of the IBM group’s paper (Jelinek, 2005).

In a retrospective overview of the statistical language analysis, Mercer and one of his colleagues in the IBM group, Kenneth Church, characterized this as “a pragmatic approach” with an “emphasis on numerical evaluations” focusing on “broad though possibly superficial coverage of unrestricted text, rather than deep analysis” (Church and Mercer, 1993: 1). We’ve told the story of the origins of statistical language analysis at greater length elsewhere—a story that, as it happens in Mercer’s case, culminates in investment in Cambridge Analytica and the election of Donald Trump in 2016—no small feat for a language game (Kalantzis and Cope, 2020: 220-21, 235-40). Since the turn of the 2020s, the feats of statistical language analysis have become all-the-more awesome, particularly in the case of Generative AI. However, the basic mechanisms and principles developed by Mercer and colleagues remain the same. Because it works as a matter of principle without an explicit understanding of syntactic and semantic patterning in language, we want to characterize these mechanisms as an anti-grammar. (Fire the linguists!)

Generative AI is a combination of two technologies, a chatbot (Weizenbaum, 1966) and a corpus of text, the large language model (LLM). The chatbot allows a human user to query the text with a prompt. The result is a uniquely reconstituted digital artifact—in text, image, sound or multimodal combination. The elementary unit of analysis in the LLM is the token. Sometimes this is a word, but in the case of words with composite meanings, a token may be less than a word. Reduced to binary notation, the machine abstracts tokens in an identical, and for the purposes of its analysis, meaningless form. They are variable only in the quantified weight of their proximity to other tokens (vectors) in written textual corpora. If the relation of tokens is meaningful, this is circumstantial and no more, because the mechanisms of statistical analysis eschew any theory of meaning. In effect, the method is a kind of anti-grammar.

On grammar

The grammar/anti-grammar distinction can help us to identify where Generative AI hits limits, in theory as well as practice. But first, if there is to be such a thing as an anti-grammar, what is grammar?

To answer this question, we draw our inspiration from a pair of linguists, Michael Halliday and Ruqaiya Hasan. “A grammar is a resource for meaning,” says Halliday, “the critical functioning semiotic by means of which we pursue our everyday life. It therefore embodies a theory of everyday life; otherwise it cannot function in this way… A grammar is a theory of human experience” (Halliday, 2000 [2002]: 369-70). Not only does grammar map the representation in language of the elemental patterns of meaning in everyday life. Language is integrally a part of life, or what Hasan calls “context.” Language is “functional,” it does something through a series of “bidirectional relations” with the material and social world (Hasan, 1999: 223).

Working closely with Halliday and Hasan, Gunther Kress drew inspiration from their functional or meaning-oriented (semiotic) approach. He extended their theory to encompass other forms of meaning that are frequently integrally connected with language, principally image (Kress and Van Leeuwen, 1996 [2021]). His interest was “motivated conjunctions of form and meaning” (Kress, 2009: 10). With Kress and others in the New London Group, we developed and applied these ideas to education under the programmatic agenda of “multiliteracies” (Cope and Kalantzis, 2023a; Kalantzis and Cope, 2023; New London Group, 1996).

Building on this work, we have mapped a theory of meaning that analyzes shared functions (reference, agency, structure, context, and interest) across all its forms (text, image, space, object, body, sound, and speech). We have called this a “transpositional grammar,” focusing on the constant and insistent movement across forms and functions (Cope and Kalantzis, 2020; Kalantzis and Cope, 2020). Like Kress, we note the profound differences between the forms of speech and written text, so radical in fact as to render the concept “language” problematic, as if text were a straightforward transliteration of speech. Speech is represented in sound, sequenced in time and aligned with embodied gesture, while written text is a variant of image, arrayed in space. They could hardly be more different (Kalantzis and Cope, 2022).

We define grammar as the patterning of meaning, the patterns designed into artifacts of meaning and the active, interested process of “making sense.” More than the linguist’s “syntax,” grammar is unavoidably linked to semantics, As such, it necessarily extends beyond language. Meanings are patterned through multimodal association across multiple forms of meaning. This is what statistical language analysis lacks, and by extension so does Generative AI. It has no idea, architecture, theory or mechanism of meaning, nor any way to take into account the human interest to mean.

In the second part of this paper, we will argue that computers can do some of this semantic work, or at least do considerably more than Generative AI will allow. But semantics was not the route taken by Mercer and his colleagues, nor their intellectual followers decades later. Though, as we will argue in the third part of the paper, it is never too late.

Incidentally, before we leave this question of the definition of grammar, that dominating grammarian of the twentieth century, Noam Chomsky, disagrees sharply with functional or semantic approaches like the one we are taking in this paper. “[G]rammar is autonomous and independent of meaning,” he says. Grammatical sentences can be meaningless, famously demonstrated in his nonsensical sentence “Colorless green ideas sleep furiously.” On the other hand, ungrammatical sentences can make sense (Chomsky, 1957: 15-17).

We would respond that Chomsky’s sentence might be meaningfully poetic, or the figment of a dream, or meaningful just as Chomsky’s famous sentence. We may not immediately understand a sentence, but between representation (meaning for oneself), communication (meaning for others), and interpretation (the sense one makes of another’s meaning), no sentence can be meaningless even if the meaning morphs through the transposition (Kalantzis and Cope, 2020: 47-63). As for incorrect grammar, in the interpretation, semantics comes to the rescue. Interpretation adds sense, righting incorrect grammar in its reception even to a point where it may prove not-so-incorrect after all, as is the case for instance in baby language and idiom.

Digging deeper into Chomsky’s elemental components of grammar, we find there is a profound semantic difference between a noun phrase (functioning to represent reference, in the terminology we have developed in our transpositional grammar), and a verb phrase (functioning to represent agency in our terminology—we explore these concepts in the second part of this paper). Even in his seminal Syntactic Structures, Chomsky eventually is forced to admit: “the fact that correspondences between formal and semantic features exist,... cannot be ignored” (Chomsky, 1957: 102). We would argue that semantics is integral to any grammar, and any apparent disjunctions must be read as parts of the system.

From beginning of his grammar project, Chomsky scorned statistical and corpus-based approaches, and he still does. Of course, Chomsky is right in terms of the human processes of participating in meaning. We generate always-novel speech and text based on generalizable patterns of meaning: “[O]ne’s ability to produce and recognize grammatical utterances is not based on notions of statistical approximation.” And we can agree that “probabilistic models give no particular insight into some of the basic problems of syntactic structure” (Chomsky, 1957: 16-17).

Making sense of complexity

This is not just a theoretical argument. A child could never learn to speak just by being exposed to the billions of empirical word-to-word vectors like those recorded in Generative AI. As the child learns, grammar is a developing theory of the world, a compression mechanism with which to make sense of the enormous complexity and variability in the world. They learn for instance the difference between a particular thing (our dog, called “Apollo”), and a general thing (“dog” is his kind of animal).

If Generative AI is an empirical way to manage the vast complexity of human meaning, traces of which have been left in written text, human intelligence is by contrast theoretical, making sense of its complexity with a working theory of meaning that we characterize as a grammar. “Apollo” is a proper noun (which we would term an instance). “Dog” is a common noun (which we would term a concept). Grammar is not just in the words; it patterns the meanings of the world. It is a theory of the world.

Chomsky’s, 1957 critique of statistical and corpus approaches was not just theoretical either. It is also about the impracticability of empirical, corpus-based approaches. “The difficulty of determining in any precise and realistic manner how many meanings several items may have in common, however, as well as the vastness of the undertaking, make the prospect for any such approach appear rather dubious” (Chomsky, 1957: 96).

With the development of Generative AI, Chomsky has been proved practically incorrect in this respect (Piantadosi, 2023). The brute force of today’s computing power has conquered the vastness of the empirical undertaking, to map the relation of every word in every written and digitally recorded text with every other nearby word. Clearly, the statistical approach of Generative AI works, albeit in its rough-and-ready kind of way. Nevertheless, even if it works this way in the machine, it is not possible that human meaning could work in a similarly empirical way. Grammar is the theoretical heuristic with which we make sense of the world.

Nevertheless, Chomsky has steadfastly maintained his distance, a paradigm away from statistical approaches to language. Six and half decades after Syntactic Structures, he and his recent co-authors maintain that OpenAI’s ChatGPT, is no more than “a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer” (Chomsky et al., 2023).

It is not so easy to dismiss Generative AI. It works effectively in response to prompts as hundreds of millions of users have seen, and to striking effect. However, as we will argue in the next sections of this paper, in its current form its mechanics have intrinsic limits. Once the limits have been bracketed, we can focus the application of Generative AI on what it currently does best, identifying ways in which it can be supplemented, complemented and put to good social and educational use.

But this is jumping ahead, foreshadowing parts of our argument that we will lay out later. Before we get that far, we will briefly mention several ironic twists in the prehistory of Generative AI—ironic because its foundational principles and processes were developed by some of the greatest linguists and grammarians.

Chomsky’s advisor while he was working on his doctorate at the University of Pennsylvania was Zellig Harris, author of the formidably rigorous Structural Linguistics (Harris, 1951). In 1954 Harris wrote a seminal paper analyzing the ways in which language has a “distributional structure” that does not require reference to meaning, though like Chomsky, he admits that grammar and meaning are at least strongly associated. A morpheme, the smallest meaningful element of language (roughly now, the token of Generative AI), does not have a single or central meaning. The range of its meanings can be discerned in its distributional structure.

Harris, however, meant semantic and grammatical distribution, not statistical distribution—“for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of use” (Harris, 1954: 156). Here we find the first use of the prhase “bag of words.” This has become one of the driving metaphors for the field of statistical language analysis now known as natural language processing, with its emphasis on statistical analysis while steadfastly ignoring grammar and meaning. This is more than ironic given the “not merely” qualifying Harris’ turn of phrase and the elaborately differentiated distributional structure that he goes on to analyze extensively in the 1954 paper and elsewhere in his linguistics.

Michael Halliday was also involved in the prehistory of the development of statistical approaches and artificial intelligence in his important work with Margaret Masterman and R.H. Richens in the 1950s in the Cambridge University Language Research unit (Kalantzis and Cope, 2020: 100-101; 224-26). There, Halliday and colleagues developed a process of “stemming” in which words were broken into grammatically and semantically distinct chunks: “walk” and “ed” for “walked,” for example. Some stemming is syntactic as well as semantic, but some is just semantic—“infamous,” for instance, is not the opposite of famous (Richens and Halliday, 1957). These were precursors to the tokens that are the elemental units of Generative AI.

What generative artificial intelligence can do

So how does Generative AI work its apparent magic by calculation alone and without any mechanism for identifying or classifying meaning?

Generative AI is a next-word predictor, no more. Or that’s what it appears to be because it’s even less, because before the words appear on our screens they have been reduced to tokens. Then there is a reduction to Unicode, the universal scripting system of digital text (Cope and Kalantzis, 2020: 23-25). What then follows is a reduction to binary notation in order to give a long and humanly unreadable name to what is by now a humanly unintelligible abstraction.

To be useful to humans, the machine must perform a mechanical process that we call transposition: source text > token > Unicode > binary notation (where all operations occur) > Unicode > token > generated text. The limits of the capacity of the machine to process artifacts of human meaning are the limits of what can be mechanically squeezed through binary notation via Unicode text. We’ll argue in the next section that, grammatically speaking, Generative AI does even not exploit the full potentials of binary computing technology. More meaning can be squeezed through binary notation than Generative AI can and does, including a wide range of multimodal meanings. Many more human meanings are not computable at all.

Bound as they are to written text, statistical language approaches like Generative AI address the fact that different instances of the same word will have different meanings, and often the differences are subtle. “Walked the dog,” is a different kind of walked from “walked to work,” or “the guard walked the prisoner to their cell.” In conventional syntactic terms, walk-ed is verb in the simple past tense. But in a more finely grained grammar, the kind of agency of the walker is different in each case. By linking to surrounding text, Generative AI can detect endless different kinds of “walked,” or what James Gee calls “situational meaning” (Gee, 2023). In grammars of language, such subtleties are captured in transitivity (Halliday, 1968), case (Fillmore, 1968), or what Charles Fillmore calls “frame semantics” (Fillmore, 1976). Analysis of the range of such action types we have classified as “transactivity,” one of the dimensions of agency in our transpositional grammar (Cope and Kalantzis, 2020: 198-202).

By statistical weighting, every recorded token can be linked by vector predicting the probability of its connection to surrounding tokens. In this way, a token can attribute thousands or tens of thousands different grammatical designs to the one word, depending on what is termed its “context window.” As a consequence, the LLMs that drive Generative AI can have a pseudo-vocabulary whose capacity to detect of shades of meaning runs to billions of tokens. The AI, however, can have no notion of even the general shape of these shades of meaning, let alone the particularities that it may happen to have identified. Humans, however, know the differences and hang on the subtleties—in the case of “walked,” between different kinds of agency.

The relation between the words written by Generative AI may prove to be semantic and syntactic as its practitioners have ably shown (Xu et al., 2015). However, the system’s capacity to generate meaningful text is not because it knows anything of meaning or grammar in a human, conceptual, or theoretical sense. Rather, it is because the text upon which it has made its calculations happens to have been patterned by humans by a process that we call grammatical. Generative AI has no way of “seeing” these designs or leveraging them for its own textual production. All it does is work with abstracted tokens that have been distinguished from each other by statistical probability in a vector space.

Not that it is always best for Generative AI to get the next-word prediction completely right. When it does, the result is boring robo-text. Generative AI allows a user to change the “temperature” (its terminology) or the predictability of the next word. Dialing down the temperature to a slightly less predictable word makes for more seemingly more interesting text.

An important question raised by Generative AI, relying as it does on the significance of next words and vector measures of proximity, is whether it works better for languages whose grammar is to a large extent represented in word order—English is the paradigmatic example. It is not clear whether it can work so well in highly inflected languages. The Australian language, Anindilyakwa is a case in point, perhaps one of the most complex examples and one where, as a consequence of its phenomenal complexity, word order can be much more fluid than English (Kalantzis and Cope, 2020: 42-47).

A note too about mathematics and computer code. We define written text as that which can be represented in Unicode, the universal character set for the connection of digital devices and the rendering of text in a wide range of media. Unicode is typographical compendium of graphemes in every human language and the most widely used symbologies (Cope and Kalantzis, 2020: 23-25). Math and code are written in Unicode in the same way as all other written text. Like written text, they are fastidiously ordered in two-dimensional array on the visual plane of a page or screen (Cope and Kalantzis, 2023a). A handful of the graphemes in Unicode are phonemes or sound representations, but most are ideographs or concept representations. However, to the extent that the unit of analysis in Generative AI is the token and not the character, whether written language, math, or code, Generative AI takes all text to be ideographic. Generative AI is a technology that, to apply our grammatical terminology, analyzes the historical ordering of ideographs in order to select the appropriate next ideograph for a newly generated text. The unusually meticulous rigor of spatial arrangement and sequencing in math and code make them especially amenable to automated manufacture in Generative AI.

Machine learning in generative artificial intelligence

Some have argued that Generative AI is a “stochastic parrot” of sorts, as if it just repeats textual formulae that it has been taught (Bender et al., 2021). But it is much more than a parrot, because it meets Kress and our requirements of “design”—the formulation of new, meaningful, and well-formed text with a selection and ordering of words that mostly has never have been made before (Cope and Kalantzis, 2020: 68-72, 301-303, Kress, 2000). This is a remarkable technical achievement, and with it come new risks as well as new opportunities (Bommasani and others, 2022).

The process of this text production, however, is nothing like human design which depends operationally on a theory of meaning that is at least implicit and sometimes explicit in human grammatical practice. This in all probabilty indicates that the rashly-promised day of “artificial general intelligence” is not so near as its enthusiasts claim. It may not even be possible in binary computing machines.

Here our case turns on an analysis of the manufacturing process of Generative AI. The machines have “read” almost every word ever published. This because almost every published word has by now been digitized and in multiple languages—an estimated five billion words. After statistical analysis of this corpus, the AI can now predict the probability of the next word after any given word.

Of course, Generative AI can’t do this on-the-fly each time a user makes a query. It has to be pre-trained. Human training of the machine on such a scale (supervised machine learning) is impractical. So the machine has trained itself via a process called—with a nod to behaviorist psychology—reinforcement learning. The machine encounters a new sentence. It reaches a new word in the corpus, predicts the next word on the basis of its prior statistical analysis, then confirms its prediction when it is right or refines its statistics if is wrong. Right is when the next word is as predicted, wrong when it is not. And so on, for billions of words. Technically, this is unsupervised machine learning or self-supervised learning (Manning et al., 2020; Zhai and Massung, 2016).

The principles of behaviorism may have at times proved too mechanistic for human sensibilities. Behaviorism, however, has at last found a welcome home in today’s binary computing machines. As we have argued at length elsewhere, the “machine learning” that characterizes today’s artificial intelligence is so fundamentally different from human learning that the two barely deserve the same word (Cope and Kalantzis, 2023c).

Even with the unprecedented computing power of today’s server farms, the billions of tokens in the source texts and the billions of variations in their meaning is computationally unmanageable. Nor is it just the billions of tokens, because for every token there may be a statistically salient token to be found one, ten, fifty words away. Every extension of range expands the calculation exponentially.

A breakthrough came in 2017 with a “transformer” that “pays attention” to patterns of statistical significance, reducing the amount of calculation potentially required to produce good results (Vaswani et al., 2017 [2023]). Again, this is statistical attention rather than grammatical attention of the kind humans apply to make sense of the world, the general shape of which we will outline in the next section. Even so, the cost of the machine time and the electricity for Generalized Pretrained Transformers (GPTs) like the ones developed by OpenAI might be $100 million or more per training session for each new version. The Generative AI developers are coy about costs, but in an interview the OpenAI CEO Sam Altman hinted at numbers of this order (Knight, 2023).

Incidentally, there is nothing virtual or immaterial about what the digital computing machine does. Its work is performed electrically in integrated silicon circuits. In the era of the web the bulk of our useable computing happens in server farms. These are massive industrial infrastructures in every respect: their physical form; their labor relations; and their energy requirements. At this level, they are not so very unlike the printing factories in an earlier age of textual production. Textual meaning-production is still a classical manufacturing industry.

If the infrastructure of Generative AI is quintessentially industrial, learning machines are not so very new either. Reinforcement machine learning is as old as the servo-mechanism. This is a general technology of self-regulation. Perhaps the first such mechanism was the self-regulating “governor” in the Bolton and Watt Steam Engine of 1788 which constantly adjusted the amount of steam for a given speed of operation (Cope and Kalantzis, 2022b).

Generative AI is as simple in its elementary mechanics as the Bolton and Watt governor—calculating the probability of concurrence involving one kind of thing, a token recorded in binary notation. These calculations are complex only in this sense: they are so painfully extended that there is no way of tracing the specific steps that produce a particular linguistic output. Forensic analysis of how a next word was generated is practically impossible because its provenance has been lost in the many layers of calculation, its “neural nets” (Rumelhart et al., 1986). As a consequence, Generative AI is a “black box.” This metaphor was created by cyberneticians in the 1950s to apply to machines the particularities of whose workings have become too convoluted for the specifics of their operation to be practically traceable (Ashby, 1956: 86).

Nevertheless, much more could be revealed about the operation of these black boxes than today’s Generative AI designers care to disclose. OpenAI’s so-called “GPT-4 Technical Report” is less a technical explanation than a sales document promoting the power of its software (OpenAI, 2023). So much also for the “open” in OpenAI now that the for-profit spin-off of the supposed not-for-profit is worth billions of dollars and has been spun into the orbit of Microsoft.

Multimodality in generative artificial intelligence

One last point before we move into a grammatical analysis of the intrinsic limits of Generative AI. Forms of meaning other than written text can also be processed through Generative AI—image, space, object, body, and sound—but only when mediated by written text.

Image, for instance, can only be processed in Generative AI by transposition through the medium of text: pixel count in source images > hand-crafted textual label (supervised machine learning) > token > Unicode > binary notation (where all operations occur) > generated pixel array with textual label. The seminal software experiment in the development of image recognition used the lightlystructured WordNet database is its source lexicon (Deng et al., 2009).

The fully automated unsupervised machine learning of Generative AI that is used for written text is not possible for image and other non-textual forms of meaning. Hand labelling is required, or in our grammatical terms, a multimodal transposition across forms of meaning. Supervised machine learning is necessary for forms of meaning other than written text.

In the case of multimodal meaning, then, the limits of written text are added to the limits of Generative AI. For an era when text is being increasingly supplemented and even to some degree supplanted by other forms of meaning—Carey Jewitt calls this the “multimodal turn” (Jewitt, 2009: 4)—it is no small irony that as a technology Generative AI prioritizes and privileges written text, albeit in the case of forms of meaning other than written text, by means of multimodal transposition.

Nevertheless, we can see how effectively Generative AI works with other forms of meaning when provided textual prompts—image particularly. Computer scientists have built “meta-transformers” that operate across a wide range of digital sources including not only 2D images, but 3D point clouds, audio, video, time series, tabular data, and more (Zhang et al., 2023). This however, requires text mediation, and the textual labels have to be applied by hand via processes of supervised machine learning (Li et al., 2023).

As for speech, most of the source text used by Generative AI was natively created as written text. We have argued elsewhere the grammar of speech is in some fundamental respects quite different from that of text (Kalantzis and Cope, 2022). Generative AI is more than anything a technology of writing, not speech. There may be some transliterations of speech in the Large Language Model corpora, but these are rarely sourced from spontaneous speech. Rather they come from transcriptions hybrid text-like speech, such as lectures, readings, or narrative scripts. As a consequence, prosody, dialect, gesticulation, embodied context, redundancy, hesitation, circumlocution, and other meaningful and distinctive features of speech are largely lost. In one way or another, the grammar of written text dominates Generative AI.

The scope and limits of generative artificial intelligence

Transpositional grammar as a measure of the scope of meaning

Now we will map the scope of Generative AI against the measure of a framework for analyzing patterns of meaning that we have proposed. We have called this a transpositional grammar (Cope and Kalantzis, 2020; Kalantzis and Cope, 2020). We argue that such a grammar is particularly apt for our era of ubiquitous, multimodal digital media.

First, a brief theoretical background, connecting transpositional grammar to its roots in Halliday and Hasan’s systemic-functional grammar (Halliday and Hasan, 1985). We identify five macro meaning functions for the analysis of any act or artifact of meaning:

• Reference What is this about?

• Agency Who or what is doing this?

• Structure How does this hang together?

• Context When/where is this connected? and

• Interest Why, or what’s this for?

Every moment of meaning, whatever its form, enacts all five functions. All meaning, we propose, is answerable to these questions.

Our concept of reference roughly aligns with Halliday and Hassan’s ideational metafunction, agency with their interpersonal metafunction, structure with their notions of mode and cohesion, context with their pragmatics of situation, and interest with their idea that all representations of meaning are motivated and on this basis functional (Halliday and Hasan, 1985; Halliday and Matthiessen, 2014).

The transposition part of the theory captures the constant movement of meanings. The five elemental functions are always present in all meaning. On this dimension, transposition is a shift in our attention between one aspect of meaning and another. While Generative AI pays statistical attention, humans pay grammatical attention.

In the terminology we have developed for transpositional grammar, meaning functions can be expressed in different forms and multimodal combinations of forms:

• Text - phonemes and ideographs, representable as graphemes across two-dimensional array in the universal characterset, Unicode

• Image - realistic or abstract representations in two-dimensional array, laid out as pixels in digital media

• Space - meanings represented across three-dimensional spatial co-ordinates

• Object - meanings in tangible, material things

• Body - meanings embodied in gesture, demeanor, bodily appearances, body ornament

• Sound - audible meanings, directed or ambient, and

• Speech - spontaneous unfolding of language through audible voice.

On this form dimension, meanings can be expressed in any form or multimodal combination of forms. We can show a picture of a mountain, or read the word “mountain,” or point to the mountain right there in front of us. It’s more or less the same mountain in each case. The meanings may substitute for each other or complement each other according to the different affordances or scope for meaning in each form. Transposition on this dimension captures the movement across and between forms of meaning.

These forms of meaning map to the embodied materiality of the human sensorium. Text and image can be exclusively a matter of sight, arrayed in the simultaneity of placement. Sound and speech can be purely a matter of sound, ordered in time. Not only does this make text and speech maximally different in their materiality of embodied human experience. Text and image are closely connected in the two-dimensional spatial array, frequently extended on a third dimension on space and object. Speech and sound are integrally connected by their sequencing in time, and closely aligned to body. Incidentally, the sheer distance and difficulty of the text < > speech transposition makes literacy one of the hardest things modern humans have to learn, requiring years of explicit teaching to reach its higher academic forms (Kalantzis et al., 2012 [2016]). This throws into question the claims of the proponents of phonics-based approaches to literacy who focus on drilling sound-letter transliterations.

We’ll take each of the meaning functions now, analyzing what is computable and what is not. Among the parts that are computable, there is much that Generative AI does not compute, and within its current technological capacities, cannot compute.

Reference

An analysis applying the grammar of reference shows the ways in which digital media extend human meaning capacities, though Generative AI fails to exploit these potentials fully.

In a traditional grammar of language, proper nouns point to particular things (take our dog, “Apollo”), and common nouns direct our attention to their generality (he is a “dog”). In image, a photograph will capture Apollo’s singularity, and a dog icon on a sign will capture the generality of dogs. For a grammar that will address this distinction multimodally, we highlight a distinction between meanings which represent an instance (one-ness, singularity), and those which represent a concept (multiple-ness, plurality, generality). Plurals and definite or indefinite articles also make this grammatical distinction, albeit at times in a haphazard way (Cope and Kalantzis, 2023d).

This is not a distinction that Generative AI can make, except by the happenstance of word collocations. If it has a semantics, at most this is latent (Landauer et al., 2007). In engineering terms, this means that it does not have a semantics at all. It’s just that sometimes—not always and unreliably—this semantic distinction can be found in source texts may come out in generated text or image.

However, digital text makes this distinction frequently, and often quite clearly. The labels on data entry forms work on a distinction between concept (the field name) and instance (the data entered). Information in databases is stored in tables, where characteristically the column and row heads reference concepts and the cells reference instances. Textual markup such as HTML and XML is a play between concepts in tags and the instances they tag. With this grammatical architecture, digital media have massively expanded and stabilized our human capacity for reference.

In the written language from which Generative AI draws, there are endless Apollos—gods, rockets, dogs and more. In our grammatical terms, the proper nouns of written text instantiate but not very well. A realistic image does a better job. And if our Apollo gets lost, nothing is as definitive as the multimodal triangulation of visible features of his body in a photograph, the “Hi, my name is Apollo” name tag with our phone number hanging from his collar, the council registration tag, the identifying chip in his leg, and the GPS tracker on his collar.

One of the remarkable achievements of the digital era is the stable “proper nouning” of so much of the world. In our grammatical terms, we would call this multimodal instantiating: the product numbers which make hour phones hardly worth stealing; the recorded image of our faces that mean we can walk onto a plane without showing our boarding pass or passport; the web pages whose URLs are always unique despite the trillions of such pages; and the objects that speak their own alphanumeric names in the binary code of the internet of things. With digital media, we have hugely extended the capacities of natural language to instantiate (Cope and Kalantzis, 2020: 83-99), though Generative AI has not. It works in a flat world of tokens and proximity vectors in recorded text, oblivious of the distinction that our computer systems elsewhere now make so reliably.

If the world were just the collocation of instances, however, the complexity of life would be overwhelming. This is how young children start their lives, immersed in the complex experiential association of instances. As they grow and learn, they increasingly conceptualize, abstracting and classifying instances with concepts. Vygotsky traces this development in rich detail in his socio-material psychology (Vygotsky, 1978). “Apollo” and “dog” is a distinction that children make early in their lives, and on this foundation, they later come to be able to make the more abstract distinctions of science, history, and language even. We manage the complexity of the world by the grammatical relation of instance and concept. Generative AI can’t make this distinction.

Elsewhere however, computer systems can and do make this distinction rigorously, at the highest level by the application of data models and ontologies, and the establishment of principles of interoperability between ontologies (Cope and Kalantzis, 2020: 301-28). The International Classification of Diseases maps our bodies and their maladies in medical information systems. Product numbering systems classify kinds of saleable thing, and this gives order to our shopping experiences. GeoNames classifies particular places into kinds of place. In our digital lives we’d literally be lost without the multimodal grammar of conceptualization and its mechanization in computer systems. The transpositions between instances and concepts are much of the time unambiguous.

Transpositional grammar does not attempt to regulate meaning states or stabilize them by their differentiation from each other. This is the occupational hazard of traditional grammars. Rather, it traces movements. So, there are no concepts that cannot be instantiated and no instances that cannot be conceptualized. Every concept is begging to be instantiated and every instance begging to be conceptualized. The difference between instantiating is a matter of attention to the same thing—grammatical attention. Meaning is architected through a number of these fundamental unities-in-tension.

Today’s computer systems are designed to manage the transpositions across endless instances classified by finely differentiated data structures. This is how they extend our natural capacities to mean accurately (in the case of instances) and powerfully (in terms of conceptual scope, rigor, and consistency).

Generative AI, however, only works with an undifferentiated “bag of words.” Its attention to these is no more than to the statistical collocation of combinations of characters expressed in the written text of Unicode. This is the grammatical reason why Generative AI makes stuff up or “hallucinates,” a metaphor coined by Baker and Kanade for AI imaging (Baker and Kanade, 2000) and now extended to all kinds of misapprehensions in AI (Klein, 2023). Perhaps a more philosophically apt concept for Generative AI is one introduced into our theoretical lexicon by the Princeton University philosopher Harry Frankfurt—it “bullshits” (Frankfurt, 2005). Generative AI bullshits about specifics because, given its particular technology of textual form, it can’t know the difference between generalizable concept and a triangulated empirical instance. This is why it can’t distinguish fact from fake.

Both instances and concepts can be specified according to their characteristics—the adjectives that qualify both proper and common nouns in old-school grammars. In our multimodal grammar we call these properties or descriptive characteristics. For some kinds of property, computers can do this today with a super-human accuracy, far more powerful than the best of adjectives. This referencing of properties may even come from sensors embedded in the things that they are referencing or located in proximity to them—temperatures, humidity, colors, sizes, weights, and such like, as well as a wide range of biometric indicators when the sensors are attached or directed to our bodies.

There are many properties, of course, that are not directly computable. In these cases, we are reduced to the vagueness of language—the properties of wine expressed by verbal metaphor and expert obscurantism in wine reviews, for instance (Lehrer, 1975). Touch can be named across the subtle variations of embodied human experience, but touch-sensitive computing is reduced to a limited repertoire of swiping and pressing.

Nevertheless, many properties can be specified with greater precision than bodily sensations expressed in natural language. By comparison, Generative AI can only work on what happens to have been added to the decontextualized written record. This is a very limited subset of computer capacities to record and analyze properties and to locate these accurately in context.

Meaning has never been as co-extensive with language, even if Wittgenstein and others might have had us come to this conclusion. “The limits of my language mean the limits of my world,” he famously said (Wittgenstein, 1922 [1933]: 5.6). Meaning has always been immanent in the social-material world, represented in the endless transpositions between image, object, space, body, and sound as well as text and speech. Meaning is not (just) in language. It is pervasive in sensuous, material meanings and our embodied relation to these. Even if written text can for the purposes of computation be reduced to Unicode in two-dimensional array and its binary substrate, lovers of text may nevertheless find their attention drawn to the look of typography, the smell of a book, the feel of a bookstore, the aesthetic of scribing, or the pleasure of reading and writing.

Digital media today manages the multimodal transpositions between instance, concept, and properties with unnatural accuracy. On the measure of the grammar of reference, computers now manage a wide range of transpositions across all forms of meaning. Objects speak their names from QR codes or in augmented reality. Our bodily features speak to our unambiguous identification as persons. Sensors reference instances and concepts by their properties.

The power of digital representation and communication of meaning extends far beyond what can be captured from the accessible corpus of written text. If this is the era of multimodal meaning in which text no longer holds a privileged position, Generative AI anachronistically limits itself to the discursive sequencing of written text. As we are arguing here, much more is computable.

The world of ubiquitous computing requires that we move beyond the old, language-centered theories of meaning. It is the grammar built into computing architectures that in part makes them so powerful, a grammar that Generative AI eschews. Or in Judea Pearl’s computer science terminology, the oft-neglected data models require as much credit for the practical effectiveness of AI as the over-emphasized algorithm (Pearl, 2018).

Agency

We can choose to attend to things (reference), or we can attend to the drives that motivate their having-become, their continuing-to-become, and their likely transformation into new becoming. This kind of attention we call agency. Between reference and agency, there is not a fixed contrast, but constant movement. Reference and agency are always transposable.

By proposing the idea of transposition, we mean to argue against a long tradition of grammatical analysis which positions kinds of meaning in a stable system of contrasts, along the lines proposed by de Saussure (De Saussure, 1916 [1983]). When we characterize grammar as a series of transpositions, we recognize that the elements that are always on the move or impatiently waiting to transform themselves into something quite different. We can reference Google, but when we Google something, Googling is a kind of agency. Nouns (reference) impatiently wait to become verbs (agency), and verbs nouns. The difference between the grammatical activity of reference and that of agency is a matter of attention, of choosing to see things one way, then another.

In the Hallidayan system, the process of turning verbs into nouns is called nominalization or grammatical metaphor. This has become a particular habit of modern science and technology, where in technical discourses actions (agency) are represented as concluding states (reference). It is also a habit of digital information systems. Data models and ontologies reference states, and the agentive connections that created these states are often implied. The diagramming tools used by software developers such as Unified Modeling Language focally attend to states in their nodes, and the actions that produced these states are secondarily indicated in connecting arrows. Action is implied in the move from state to state (Cope and Kalantzis, 2020: 122-25).

Philosophers have long been wary of modernity for its objectifications, for its habit of artificially freezing fluid reality into its thingness—the hazardous tendency of empiricists and rationalizers, says Henri Bergson (Bergson, 1903 [1912]: 30). Or, in Alfred North Whitehead’s words, “we tend to analyze the world in terms of static categories” when, in the endless flux of the world, “all things flow” (Whitehead, 1928 [1978]: 208).

Notwithstanding these modern proclivities to privilege reference, agency can never be erased. In a transpositional grammar, reference and agency are only momentarily observable as such. This is much like quantum mechanics, where realities are that transitory are momentarily framed as states only by the act of observation (Kalantzis and Cope, 2020: 143-45). For a transpositional grammar, we characterize this as a shift in attention. A tangential comment: “quantum computing,” so-called, still works with the old zero and one binaries. In a transpositional grammar we want to adopt the idea from quantum physics that binaries oversimplify the tensions created by the uneasy immanence of one kind of meaning in another, its impatient begging to transpose (Cope and Kalantzis, 2020: 129-30).

Applying the reference/agency distinction to a multimodal example, still image also prioritizes reference by dint of its affordances. Things captured in an instant are laid out in two-dimensional spatial array. We are forced to imply agency through devices such as vectors and vanishing points. Specialized visual forms such as flow diagrams can show agency more explicitly, though even here the question remains as to the extent to which nodes are prioritized in the representation. Moving image, working objects, and spatially framed wayfinding offer more scope for the representation of agency.

Computers become agents themselves when programmed into sequences of machine-human interaction. The ATM machine is a case in point, linking referenced things (your person, your bank account, and money dispensed) in a chain of action (withdrawing cash). Claude Shannon’s insight for the mechanization of agency was the translation into binary notation of logical decision paths based on Boolean algebra (Shannon, 1938). Electrical switching translates into a framework of operators for “and” (a conjunction, for instance, when two or more instances share a concept), “or” (a disjunction, for instance, where an instance can be classified by one concept only if it is not classified by another), or “not” (a negation, for instance, where an instance is not present in a concept). The operators become the basis for chains of human-computer interaction. Building on this mechanism, the grammatical reference < > agency dynamic is systematically built into the machine and its “user interface” (Cope and Kalantzis, 2023d).

This is not the case with Generative AI which can only capture reference/agency grammatical distinction unreliably through circumstantial word ordering. Generative AI might be used to support the work of autonomous or semi-autonomous agents (Liu et al., 2023), but not with reliability of explicit programming by humans.

Structure

Meanings hang together. They have order. They have more-or-less definable boundaries at different scales—a picture or a gallery; a paragraph or a book; a gesture or action sequence; a room or a building. Halliday and Hasan characterize the result of making things hang together as the cohesion of a given mode (Halliday and Hasan, 1985). We call this structure, both as an immanent ordering (a noun: the order that might be found in an artifact of meaning) and the job of creating order (a verb: making a meaning as an ordering or design process).

Digital technologies offer us media with which to structure meaning. Unicode allows humans to arrange text of all kinds in every language any widely used symbology, from phonemes to ideographs such as emojis, mathematical symbols, computer code, and more. Pixel arrays allow us to represent a two-dimensional image. Space and object can be represented in virtual reality or 3D printing. Sound and speech can be decomposed and recomposed from the elementary constituents of sound. Embodied gesture and movement can be captured in video.

In the era of analogue manufacture of meaning, letterpress printing, photography, sound recording, and other media of production and transmission were quite different kinds of technology. For reasons in part rooted in the materiality of their structuring, in the analogue era the various media of meaning-making tended to go their own relatively separate ways.

Now that elementary manufacturing technology is binary notation, multimodal meanings are easier to assemble and more accessible to non-professionals. The shared substrate of digital media accounts in part for the pervasive multimodality of today’s means of production and consumption of meaning.

Digitization transforms the breadth and capacity of humans in the production and distribution of meaning. With its 149,186 characters, Unicode is far more than any human could pull from memory as they read or write. It is far more than they can themselves practically use, but nevertheless now within the realm of computable possibility. The JPEG image format and its interoperable variants can, in Red–Green–Blue (RGB) combination and 24-bit encoding, represent 16, 777, 216 color alternatives, and that’s just for one pixel. Megapixel counts on most screens make visual contrasts far finer than the human visual capacities. Sound is recorded as a numbered binary at a rate of 44.1 kHz, well beyond the range of human hearing. Digital media today far exceed natural human capacities.

This is where, again, Generative AI is confined to the limits of written text while the scope of computability is much broader. Take image, for instance. Humans can distinguish perhaps ten million color distinctions. The computer can make millions more, though there is little point in computing beyond human capacities. The capacity of humans to name color properties in natural language is far more limited than the human eye or the computer (Cope and Kalantzis, 2020: 135-53). To the extent that multimodal AI is captive to its written textual labeling, its capabilities to identify colors are less than humans and computers.

Computers do a huge amount of mindless calculation to represent an image. They apply the binary notation to name a particular color for a single pixel. Then they count to locate that pixel in its place among millions of other pixels across the x and y axes. By this process, a digital camera may in a certain sense see, but it doesn’t perceive in anything like a human sense. Its processes of structuring are fundamentally different.

Colin McGinn uses the concept “mindsight” to characterize human visual meaning. The meaning of an image is a relation between perception and its conceptualization in one’s mind’s eye (McGinn, 2004). Mindsight works on criterial attributes determined by human interests and their cognitive frames, while seeing encompasses endless noisy detail, mostly unnecessary. Human visual meaning filters the message of the image from the noise of irrelevance and redundancy. This we would characterize as the grammatical process of finding or making structure in visual meaning.

Meanwhile, there are other forms of meaning that are hardly computable at all—touch, taste, and smell, for instance, even though we can structure them with a great deal of finesse in clothing, cooking, and aroma. If Generative AI is to represent these meanings, it is through mediation of written text. But the meanings represented in these media, as we know all-too well, are much more finely finessed in our visceral human experience than written textual descriptors will allow. Much of human meaning is beyond the range of computability at all.

Context

In natural language, intelligible meanings are not just in the words of the interlocutors. They are located in context. Take speech: the meaning of the words is as much in the multimodal relation of speech to body, space, and object. And take two key co-ordinates of context, time and space, whose meanings are frequently circumstantial and frequently unremarked (Kalantzis and Cope, 2020: 63-90). Linguists and semioticians call this diectics, where the meaning of “here” or “tomorrow” is not helpfully in the words but in the situation to which the words might be pointing. The meaning is material and experiential. It is formed through the multimodal transposition across different forms of meaning.

For our analysis of the scope and limits of Generative AI to manage the grammar of context, the limits of text-centered understandings of meaning are a starting point. Spoken words rely on embodied presence to tie their otherwise unintelligible vagueness down to relevant particularities. In time, the meanings of “after,” “before,” “during,” “since,” and “until” are spread between speech, body, and context. A finely calibrated meaning of time must be multimodal. In space, the meanings of “above,” “across,” “alongside,” “among,” “down,” “from,” “in,” “out,” “through,” “to,” and “under” are emptied of meaning if not multimodal, crossing from speech into object and space.

When Generative AI is dealing with transcripts of speech, specific meanings will remain vague because context is lost. In written text, explicit cross-reference to particularities is required before its grammar of context begins to make any particular sense. Tenses also manage time, but with a vagueness that demands multimodal supplement. And even in writing, explicit reference to context might be too many words away to be noticeable across statistical vectors.

However, in an era of pervasive connection to spatio-temporal computing devices—today even attached to our bodies—our multimodal experience of time and space has come to far exceed natural language for its particularity. Computers have standardized space on earth with geocoordinates, and all-but definitively disambiguated their natural language referents with GeoNames. They bring time and place together as events in the iCal standard. Nowadays, our devices stamp time and place incidental to experience—to be processed for dynamic computer maps, fitness trackers, personal calendars, and such like (Kalantzis and Cope, 2020: 63-84).

The measured relations of time to space are available for on-the-fly calculation and visualization with an accuracy of microseconds—for practical purposes far surpassing most human needs, and certainly beyond the thresholds of ordinary human experience. Our mobile map app tells us something much more useful than “near”—it gives the distance an exact number; it gives arrival a time based on current speed; it shows directions and place names to support our wayfinding. This is a multimodal experience of embodied, spatial meaning, directed by a device that shows visually in a map, names in text, and speaks that text. We have come to depend on these multimodal transposition machines, these spatio-temporal prostheses.

In such ways, these crucial co-ordinates of context, time and space, speak for themselves and for us, clearly and with minimal ambiguity. And like so many other places in this grammar, time and space are not things separate in themselves; they are a relation in tension which we experience as events that have speed and proximity. The meaning of each is in the tense multimodal movement between the one and the other.

In the case of time and space, digital media make for yet another profound existential shift in our everyday living. Spoken words position self at the center of time and place. An ego-centered reading of multimodality is then required to make proper sense of time (“now” and “then”) and place (“here” and “there”). Multimodal transposition is required to make sense of the words at all.

The grammar of digital meaning is multimodal too, but in a radically different way. It speaks from a material reality measured against absolute datum points, not the circumstantial relativity and subjectivity marked by the syntax of speech. The datum points may be historically arbitrary (years counted since the reported birth of a religious persona, or the zero point for Coordinated Universal Time that runs through post-imperial London), but in the pragmatics of wayfinding and event they are nevertheless meant as absolutes and universals today. The GPS may translate this to “turn left” and it may allow is to recenter the map so the arrow of our looking is pointing up. The machine may still perform this grammatical transposition to assuage our diminished egos, but the underling spatial grammar is powered by dead reckoning.

This makes for an irony in our age of supposed post-Einsteinian relativity. Through the contextual grounding of time and space by binary computing, the experience of context is for practical purposes universal and absolute (Cope and Kalantzis, 2023d: 11-13).

Kant had regarded those all-important contextual co-ordinates, time and space, as categorical impositions. For him, the elemental categories of meaning were cognitive and languaged. They were mental frames, layered over the world. This seemed to be the case when ego-centered speech dominated human discourse, though that was never quite true because even then the meaning is as much outside the categorizer as within.

In the digital era, Kant’s categorical presupposition is less the case than ever. When every human is marching to the same beat and their place has been definitively identified according to geospatial co-ordinates, at least when it comes to time and space Kant has been proved more wrong than ever. The locus of contextual meaning has moved from cognitive-linguistic centering of the world to the immanent multimodal materiality of geolocation. The maps and their marching orders follow.

Our digital devices can also today incidentally record much of our now so-very grounded spatio-temporal experience of context. Bound to written text, Generative AI misses all of this, tied to already published writings that their intrinsically vague, decontextualized, and ego-centered grammar of context. It misses the potentials of the new multimodal configurations of context in the digital era. Tethered as it is to written text, Generative AI fails to use much that is computable, and of course everything that is not.

This is a move of enormous semiotic and historical significance, away from cultures relying on personally-framed natural language. We are moving towards a society of pervasive cyber-social meaning (Cope and Kalantzis, 2023d), dependent in so many domains on objectified, immanently material, multimodal meanings operationalized through the technology of binary notation. My face and the barcode on my passport have become more reliable proper nouns for me (functionally: reference) than the name I can speak. The “right” and “left” on my phone (functionally: context) can be trusted more than my own judgment of place because, when my phone speaks, it is from a space > speech transposition whose source geocoordinates are absolute and fixed. The center of gravity of cyber-social meanings has moved away from language and into my embodied life and material existence.

Félix Guattari has some nice turns of phrase for these underlying semiotic processes. They are, he says, a “machinic unconscious” powered by “asignifying semiotics” (Guattari, 1979 [2001]).

Interest

Meanings are motivated. They are functional. They do things. In the vernacular, we call this “making sense.” Sense does not just exist. It is made and the roots of this making are what we term interest, as would Habermas (Habermas, 1968 [1971]).

Interest is a kind of gestalt, a meaning that only springs to mind as one squints at a whole. Interest might be manifest as worldview, ideological frame, paradigm, way of thinking, form of life, persona, or identity positionality. Machines can’t have interests, though of course their human designers and users do. Humans vicariously infuse their machines with interest.

If machines are themselves disinterested, how can they help identify interests in human meaning?In natural language processing, sentiment analysis picks up words indicating emotion and judgment. Emoticons are included in the Unicode character set, capturing a wide range of sentiments that roughly maps to facial expressions and gestures. AI is captive to the vagaries of these second-order representations. On a scale of one to five, a survey may ask, are you very unsatisfied, unsatisfied, partly satisfied, satisfied, or very satisfied? One point on this person’s scale of satisfaction may be another person’s point of dissatisfaction. Interests can be as varied as humans and their situations.

It is possible to name interests by the macro-frames of politics, religion, worldview, paradigm, domain of expertise, or any number of other such labels. But the boundaries of adherence, the internal variety, and the intersection of different interests are more than textual labels can handle adequately. Interests also complement and conflict; they are not things in themselves but dialectical relations between interests—another vector of transposition.

Interests, moreover, are frequently obscured. Realities may be so embedded in the seemingly inexorable materialities of the world that they are in effect hidden in plain sight. This is how the commonsense of the lifeworld obscures deeper and broader interplays of interest. Oftentimes the frames of interest themselves occlude—they blindsight us to realities which may be obvious from the perspective of another frame of interest (Kalantzis and Cope, 2020: 189-335). Generative AI has no way of analyzing whether something written is a mere figment of ideology.

To apply our grammatical metaphor, interests can only make a more complete sense when they are parsed. It is hard for humans to unpack the underlying designs of interest. But computers can only offer mediated and superficial support, even with their most sophisticated sentiment analyses.

Generative AI is nevertheless unavoidably interested in interest, though it buries its metainterest in the pretense of neutrality. In the digital record of human experience there are all sorts of interests that have nowadays become unacceptable to liberal sensibilities: sexism, racism, homophobia, and a list of other breaches of newfound social decency. There are also to be found in the historical textual record dangerous recipes for hatred, suicide, violence, terrorism, and more.

To be an acceptable project nowadays, these parts of the record of human experience need to be blocked from view. Generative AI achieves this with filters. These have to be applied by hand through processes of supervised machine learning. This is how Generative AI looks after our morals. Among other things, it expunges from the record of the human experience naughty words and female nipples. Clever users can nevertheless manipulate the AI to bring back to life the range of ideologies represented in the textual sources—these are called “jailbreaks” (Shen et al., 2023).

The larger question is, what is the frame of interest that has been handcrafted into Generative AI? What precisely is the extent of its distortion of the texts of history as it edits out interests other than its own? How and why does it bury the interest of liberal capitalism in a monotone, unself-reflective politeness?

We’re not suggesting of course that Generative AI should be allowed to spew hateful and dangerous stuff—which it certainly would if its technology were not deliberately restrained. Ours is a more modest suggestion: to be explicit about its frame of interest rather than to feign disinterest. Generative AI should also be able give voice to a wider range of acceptable interests. Will it filter out humanistic critics of the military-industrial complex of which the AI itself is rapidly becoming a cornerstone (Cope and Kalantzis, 2022a)?

When we parse Generative AI’s management of interest, we find its roots are firmly in the past because all it has for its source are legacy texts. It will not be able to handle genuinely future-oriented thinking. If we think AI might create a world of fully automated luxury communism (Bastani, 2019), a world with high technology but zero inequality, Generative AI could never help us get to that, though these kinds of aspirational stories are to be found in the historical record, even if, as often as not, they have come to unhappy endings.

Chomsky’s politics now looms larger as his intellectual legacy than his linguistics, but this does find a happy place in Generative AI either. He writes, “In the absence of a capacity to reason from moral principles, ChatGPT was crudely restricted by its programmers from contributing anything novel to controversial—that is, important—discussions. It sacrificed creativity for a kind of amorality” (Chomsky et al., 2023).

Applications: Supplementing and recalibrating generative artificial intelligence

In the first two parts of this paper, we have attempted to analyze the scope and limits of Generative AI against the measure of a multimodal grammar. Confined as it is to the statistical analysis of written text, there is much more to meaning than Generative AI can process. Much more is computable. Even more is not. What now are the implications? How might this grammatical analysis of machine-mediated meaning be put to work? We’ll suggest two levels of action: at the level of pedagogical practice, and at the level of the underlying code.

Expanding pedagogical practices

At the pedagogical level it is not possible to rely on Generative AI for much that we need to do in education. Education is characterized by certain “epistemic virtues,” to apply a phrase coined by Fazal Rizvi (Rizvi, 2015: 10). One such virtue is to ground understanding in fact, an empirical process supported by the grammatical processes of instantiation, the specification of properties and verification through multimodal transposition. Another is to develop and apply theoretical or disciplinary frameworks, a grammatical process that we have termed conceptualization. Still another is to account for the patterns of agency that have been transposed into referenced things; another to see patterns in structures of meaning that are semantic rather than statistical; another is to find meaning distributed across and between words and their contexts; and another is critical analysis, or the interrogation of interests.

In this paper we have shown that Generative AI is not terribly good at any of these things. Nevertheless, once we have parsed its limitations, there is much that can be done with Generative AI in education.

This is the lesson for learners and teachers: don’t ask it to do more than it can. Give it facts, but don’t ask for them. Give it disciplinary concepts rigorously defined elsewhere if you are going to ask it to work with them. Supply specific context when needed. Don’t ask for or expect it to express interests. But Generative AI is very good at ordering words or narrativizing. Parsing its grammatical affordances in these ways can point our attention to the best uses for Generative AI.

In our research and development work, we have been building a Generative AI web application for educational applications, CGMap (Common Ground Map). Located within our CGScholar platform (Cope and Kalantzis, 2023b), CGMap can connect to publicly accessible Large Language Models via API (Application Programming Interface). So far we have for practical reasons connected to OpenAI’s series of GPTs, but the software could connect to others when access to them becomes more convenient.

CGMap recalibrates the Large Language Model (LLM) via four mechanisms:

1. Prompt engineering - the requests with which the chatbot queries the LLM

2. Fine tuning - supplementing the LLM with trusted text

3. Ontology supplements - bringing into the machine analysis human-crafted discipline and domain understandings as expressed in data models and interoperability standards, and

4. Human moderation - making the interactions with the LLM transparent to users, and requiring human moderation of every AI response.

We would argue that these are essential software adjustments if Generative AI is set to work in education settings.

Our approach to prompt engineering focuses on high-level disciplinary narratives as represented in academic literacies. The prompts focus on the form of the learners’ knowledge exposition. If Generative AI does anything well, it is the mechanization of genre (Kalantzis and Cope, 2020: 181-83). It analyzes students multimodal writing most constructively when the prompts are explicit about the most appropriate form of the text. In CGMap, the AI will run through a piece of student work in multiple passes, each pass focusing on a key aspect of that genre drawn from a rubric. It offers feedback on suggestions on that aspect. It allows students to run the review multiple times to elicit variations of the response. They can query the AI response. One of the keys to effective prompt engineering is to provide verbose prompts, more ponderously explicit than conversation with human ever needs to be. We describe the results of our initial trials elsewhere (Tzirides et al., 2023). Further reporting is currently in press or underway.

Another area of recalibration in CGMap is fine tuning (Lv et al., 2023). This process provides supplementary text for the LLM deemed by the teacher to have educational value. These texts are then prioritized ahead of those in the generic, internet-wide LLM. We’ve done this to remarkable effect with tens of millions of words written by our graduate students combined with our own books and writings on semiotics and education. This reduces the LLM’s reliance on the unholy mix of insight and junk it has found on the internet.

The third area of our activity has been to supplement the LLM with carefully structured ontologies. We have tested this in the domain of medical education. Here, the ontologies are extensive and rigorous for the best reasons (clear documentation of medical cases for the purposes of medical communication) as well as the worst (the money game of medical insurance claims). The AI can be all-the-more powerful when ontologies are brought into play.

A current project has medical students tying their clinical thinking into rigorous, ontology-connected knowledge graphs. Then, having also provided the facts of the case, they use Generative AI to help them write a narrative in the form of a doctor’s clinical notes. The Generative AI should never be asked to provide the facts of the case—the user or the medical informatics system must do this. Nor can it apply the rigorous disciplinary frame of the ontology—we add this in our ontology mapping application. But it can help offer feedback in support of the development of a well-formed clinical narrative to support an educational and medical objective that we would characterize as “critical clinical thinking” (Cope et al., 2022).

A final major area of our software development has been essential human moderation of the AI. Students are provided with the same rubric as the AI, even if its text is inelegant for its pedantic explicitness and reiterations in synonymous phrasing. We don’t ever allow the AI to pass judgement or make suggestions without human checks on the same measure—by peers, self and/or instructor. The AI may be a collaborator, but it is not an oracle. We ask our students to account for the differences between these kinds of feedback. We also require students to declare the ways in which they used AI and human feedback in the development of their work. The changes are recorded for our learning analytics through version control.

These, then are our recommendations for the application of Generative AI in pedagogy. At the most general level, this is how we might parse its designs and on this basis supplement them:

• Reference - supplementing Generative AI with the epistemic virtues it lacks: empirical triangulation to ensure the factual basis of instances; conceptual work of theoretical and disciplinary frameworks; and the interrogation of both instances and concepts according to their criterial properties

• Agency - beyond the passivity and stability of objects and states, tracking the patterns in the human and natural agencies where everything is always in process

• Structure - realizing the affordances and analyzing the limits of media and tracing the multimodal transpositions, how to meanings metamorphose as they cross and combine different forms of meaning

• Context - making holistic meaning where in the case of Generative AI, meaning is more than text, and even in the case of text, distributed between written text and its contexts, and

• Interest - critical interrogation of AI source corpora and the companies that apply filters, fine tuning the AI with text judged more truthful to a domain and human moderation.

Expanding the scope of the code

Beyond this pedagogical focus, there are also implications for the underlying code. Generative AI has ridden a wave of statistical euphoria. Not only have most hithertoo published words now been digitally recorded, but the machines and the statistics are able to process these across relevant vectors. This of course, is a very powerful development. But beyond this, the AI can only become more powerful if it can leverage the potentials for computability that are out of reach for the “bag of words.” The LLM contains an inhuman number of words, multiplied by preprocessed information about their ordering. That’s what makes the technology work, and why it can prove helpful. It’s not unlike the difference between a library and the mind. Libraries support minds precisely for their difference. But this doesn’t mean we can use libraries as a substitute for minds.

Because Generative AI is relies on a library of digitized texts, its scope is also confined to the documented past. For this reason, AI is unable to transcend its past, even after the unacceptable aspects of that past in the historical record have been suppressed. However large the language models, corpora are closed by the past while the possibilities for future human meaning are open. Every design emerging from Generative AI may be new in the sense that its properly formed words have never been generated before. However, it can't be innovative in the sense of creating new vectors of meaning. To this extent, it will tend always to average its responses to norms of the past. We should hope for something better.

Something better demands the addition of a peculiarly human “grammatical attention”—much more than statistical attention. This can be added to the AI with formal, domain-specific ontologies—the focused attention medicine applies to our ailing bodies, for instance. Beyond that, we are making the case for a metaontology of meanings along the lines we have developed and applied in our transpositional grammar (Cope et al., 2011; Cope and Kalantzis, 2020: 322-28).

At this point, the grammatical project becomes a software project. In an integrated program we need to complement statistical with grammatical approaches—or in computer science terms, the data modelling for which Judea Pearl pleads (Pearl, 2018).

• Reference: complementing Generative AI with data models, ontologies, tagging schemas, and database architectures

• Agency: supplementing reference to objects and states by explicitly calling out actions and flows

• Structure: critical refinement of tagging schemes, particularly for object, space and body as represented digital media

• Context: supervised and unsupervised supplementation of digital media with contextual co-ordinates, and

• Interest: developing and applying rigorous and adaptable ontologies of positionality and ideology.

Even then, there will be profoundly human meanings that are not computable and future meanings that must be imagined and realized in ways that only humans can do.

To parse the world and to change the world

The world is endlessly complex, and what’s fundamentally different between the statistical attention that drives Generative AI and the grammatical attention of humans are their compression mechanisms. Chomksy is right about this, even if we would disagree with him about the language-centric and the brain-bound biophysical substrate that he identifies as the basis for elemental patterns in human meaning (Cope and Kalantzis, 2020: 288-300). Contrary to Chomsky, we would argue that the compression mechanisms are historically evolved and located in the transpositions across and between the mental and material worlds (Lim et al., 2022)—in text, image, space, object, body, sound, and speech. Parsing the functions of meaning—reference, agency, structure, context and interest—we have attempted to capture in broad outline a metaontology of embodied, material human experience.

To know the world is to trace the patterning in its otherwise bewildering cacophony of meanings. We need ordering principles and practices with which we make sense of endless complexity. This is where the human history of mind meets the natural history of the brain. It is how learning occurs.

On the subject of learning, “machine learning” is one of the much-vaunted features of AI. It posits the possibility that binary computing machines could think in human-like ways. The advocates of Generative AI boldly promise that, with its self-taught reinforcement learning we are on the way to “artificial general intelligence.” We think “preprocessed statistical analysis of text recorded in binary notation” would be a more accurate way to describe the mechanism by which word-to-word vector weightings are retrieved in Generative AI. It’s just counting—mindless empiricism even if on an awesomely industrial scale. The narrowly statistical attention of Generative AI does no more than measure quantifiable regularity of the appearance of this empirical token with that. Empirical work on this scale is realistic for a machine, but unrealistic for a human.

Grammatical attention, by contrast, demands a meaning-of-meaning reflexivity. Human learning must always be grammatical and thus theoretical in temper if it is to tame real world complexity. It is the development of capacities to theorize along the lines outlined by Vygotsky that makes human learning different from the painful and (thankfully) impossible-for-humans empiricism of LLMs. Machine learning is also not learning in anything like the human sense because that would require some degree of self-understanding of generalizable principles (Cope and Kalantzis, 2023c).

This is the foundation of the pedagogical argument we would make for grammar in our broad sense of that term. Schooling itself requires compression mechanisms, where learners learn the general shape of the world in optimal time and in order to take this generalizable knowledge out to the otherwise far-too-complex world of application.

And after that, we need to imagine radically different and qualitatively better human possibilities. Our duty is not merely to reiterate the world, but to change it.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author biographies

Bill Cope is a Professor at the University of Illinois.

Mary Kalantzis is a Professor at the University of Illinois.

References

Ashby

(1956) An Introduction to Cybernetics. London: Chapham & Hall.

Baker

Kanade

(2000) Hallucinating Faces. In: Paper Presented at the Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 28–30 March 2000. DOI: 10.1109/AFGR.2000.840616.

Bastani

(2019) Fully Automated Luxury Communism: A Manifesto. London: Verso.

Bender

Gebru

McMillan-Major

, et al. (2021) On the dangers of stochastic parrots: can Language Models Be too big? In: FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Canada, March 3–10, 2021, pp. 610–623.

Bergson

(1903) An Introduction to Metaphysics. translated by T.E. Hulme. New York NY: G.P. Putnam.

Bommasani

others (2022) On the Opportunities and Risks of Foundation Models. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2108.07258.

Brown

John

Della Pietra

, et al. (1990) A statistical approach to machine translation. Computational Linguistics 16(2): 79–85.

Chomsky

(1957) Syntactic Structures. Amsterdam NL: de Gruyter Mouton.

Chomsky

Roberts

Watumu

(2023) The false promise of ChatGPT. New York: New York Times.

10.

Church

Mercer

(1993) Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19(1): 1–24.

11.

Cope

Kalantzis

(2020) Making Sense: Reference, Agency and Structure in a Grammar of Multimodal Meaning. Cambridge UK: Cambridge University Press. DOI: 10.1017/9781316459645.

12.

Cope

Kalantzis

(2022a) Artificial intelligence in the long view: from mechanical intelligence to cyber-social systems. Discover Artificial Intelligence 2(13): 1–18. DOI: 10.1007/s44163-022-00029-1.

13.

Cope

Kalantzis

(2022b) The cybernetics of learning. Educational Philosophy and Theory; 54: 1–37. DOI: 10.1080/00131857.2022.2033213.

14.

Cope

Kalantzis

(2023a) Towards education justice: multiliteracies revisited. In: Zapata

Kalantzis

Cope

(eds) Multiliteracies in International Educational Contexts: Towards Education Justice. London: Routledge.

15.

Cope

Kalantzis

(2023b) Creating a different kind of learning management system: the CGScholar experiment. In: Matthew

(ed) Promoting Next-Generation Learning Environments through CGScholar. Hershey PA: IGI Global, pp. 1–18. DOI: 10.4018/978-1-6684-5124-3.ch001.

16.

Cope

Kalantzis

(2023c) On cyber-social learning: a critique of artificial intelligence in education. In: Kourkoulou

Tzirides

Cope

, et al. (eds) Trust and Inclusion in AI-Mediated Education: Where Human Learning Meets Learning Machines. Cham: Springer.

17.

Cope

Kalantzis

(2023d) On cyber-social meaning: the clause, revised. The International Journal of Communication and Linguistic Studies 21(2): 1–18. DOI: 10.18848/2327-7882/CGP/v21i02/1-18.

18.

Cope

Kalantzis

Magee

(2011) Towards a Semantic Web: Connecting Knowledge in Academic Research. Cambridge UK: Elsevier.

19.

Cope

Kalantzis

Zhai

, et al. (2022) Maps of medical reason: applying knowledge graphs and artificial intelligence in medical education and practice. In: Michael

Petar

Sarah

(eds) Bioinformational Philosophy and Postdigital Knowledge Ecologies. Cham: Springer, pp.133–159. DOI: 10.1007/978-3-030-95006-4_8.

20.

de Saussure

(1916) Course in General Linguistics. translated by Roy Harris. London UK: Duckworth.

21.

Deng

Dong

Socher

, et al. (2009) ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 20–25 June 2009. DOI: 10.1109/CVPR.2009.5206848.

22.

Fillmore

(1968) The case for case. In: Emmon

Robert

(eds) Universals in Linguistic Theory. New York NY: Holt, Rinehart and Winston, pp. 1–88.

23.

Fillmore

(1976) Frame semantics and the nature of language. Annals of the New York Academy of Sciences 280(1): 20–32.

24.

Frankfurt

(2005) On Bullshit. Princeton, NJ: Princeton University Press.

25.

Gee

(2023) GPT-4 and Thematic Meanings. Available at: https://www.protohumancurriculum.com/insights-circle/gpt4-and-thematic-meaning.

26.

Guattari

(1979) The Machinic Unconconscious: Essays in Schizoanalysis. translated by Taylor Adkins. Los Angeles CA: Semiotext(e).

27.

Habermas

(1968) Knowledge and Human Interests. translated by Jeremy J. Shapiro. Boston MA: Beacon Press.

28.

Halliday

MAK

(1968) Notes on transitivity and theme in English: Part 3. Journal of Linguistics 4(2): 179–215.

29.

Halliday

MAK

(2000) Grammar and daily life: concurrence and complementarity. In: Webster

(ed) On Grammar: The Collected Works of M.A.K. Halliday. London UK: Continuum, Vol. 1, pp.369–383.

30.

Halliday

MAK

Hasan

(1985) Context, and Text: Aspects of Language in a Social-Semiotic Perspective. Geelong AU: Deakin University Press.

31.

Halliday

MAK

Matthiessen

CMIM

(2014) Halliday’s Introduction to Functional Grammar. 4th edition. Milton Park UK: Routledge.

32.

Harris

(1951) Structural Linguistics. Chicago IL: University of Chicago Press.

33.

Harris

(1954) Distributional structure. Word 10(2–3): 146–162. DOI: 10.1080/00437956.1954.11659520.

34.

Hasan

(1999) Speaking with reference to context. In: Ghadessy

(ed) Text and Context in Functional Linguistics. Amsterdam NL: John Benjamins, pp. 219–328.

35.

Jelinek

(2005) Some of my best friends are linguists. Language Resources and Evaluation 39(1): 25–34. DOI: 10.1007/s10579-005-2693-4.

36.

Jewitt

(ed) (2009) The Routledge Handbook of Multimodal Analysis. London UK: Routledge.

37.

Kalantzis

Cope

(2020) Adding Sense: Context and Interest in a Grammar of Multimodal Meaning. Cambridge UK: Cambridge University Press. DOI: 10.1017/9781108862059.

38.

Kalantzis

Cope

(2022) After Language: a grammar of multiform transposition. In: Christiane

(ed) Foreign Language Learning in the Digital Age: Theory and Pedagogy for Developing Literacies. London: Routledge, pp. 34–64. DOI: 10.4324/9781003032083-4.

39.

Kalantzis

Cope

(2023) Multiliteracies: life of an idea. The International Journal of Literacies 30(2): 17–89. DOI: 10.18848/2327-0136/CGP/v30i02/17-89.

40.

Kalantzis

Cope

Chan

, et al. (2012) Literacies. Cambridge, UK: Cambridge University Press.

41.

Klein

(2023) AI Machines Aren’t ‘Hallucinating,’ but Their Makers Are. London, UK: The Guardian. https://www.theguardian.com/commentisfree/2023/may/08/ai-machines-hallucinating-naomi-klein

42.

Knight

(2023) OpenAI’s CEO Says the Age of Giant AI Models Is Already over. San Francisco, CA: Wired. https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/

43.

Kress

(2000) Design and transformation: new theories of meaning. In: Bill

Mary

(eds) Multiliteracies: Literacy Learning and the Design of Social Futures. London: Routledge, pp. 153–161.

44.

Kress

(2009) Multimodality: A Social Semiotic Approach to Contemporary Communication. London: Routledge.

45.

Kress

van Leeuwen

(1996) Reading Images: The Grammar of Visual Design. London: Routledge.

46.

Landauer

Danielle

McNamara

, et al. (eds) (2007) Handbook of Latent Semantic Analysis. New York NY: Routledge.

47.

Lehrer

(1975) Talking about wine. Language 51(4): 901–923. DOI: 10.2307/412700.

48.

Zhang

Chen

, et al. (2023) MIMIC-IT: Multi-Modal In-Context Instruction Tuning. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2306.05425.

49.

Lim

Cope

Kalantzis

(2022) A metalanguage for learning: rebalancing the cognitive with the socio-material. Frontiers in Communication 7: 1–15. (Article 830613) DOI: 10.3389/fcomm.2022.830613.

50.

Liu

Yao

Zhang

, et al. (2023) BOLAA: Benchmarking and Orchestrating LLM-Augmented Autonomous Agents. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2308.05960.

51.

Yang

Liu

, et al. (2023) Full Parameter Fine-tuning for Large Language Models with Limited Resources. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2306.09782.

52.

Manning

Clark

Hewitt

, et al. (2020) Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences 117(48): 30046–30054. DOI: 10.1073/pnas.1907367117.

53.

McGinn

(2004) Mindsight: Image, Dream, Meaning. Cambridge MA: Harvard University Press.

54.

Mider

(2016) What Kind of Man Spends Millions to Elect Ted Cruz? New York, NY: Bloomberg News. https://www.bloomberg.com/politics/features/2016-01-20/what-kind-of-man-spends-millions-to-elect-ted-cruz-?leadSource=uverifywall

55.

New London Group (1996) A pedagogy of multiliteracies: designing social futures. Harvard Educational Review 66(1): 60–92. DOI: 10.17763/haer.66.1.17370n67v22j160u.

56.

OpenAI (2023) GPT-4 Technical Report. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2303.08774.

57.

Pearl

(2018) Theoretical Impediments to Machine Learning with Seven Sparks from the Causal Revolution. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.1801.04016.

58.

Piantadosi

(2023) Modern Language Models Refute Chomsky’s Approach to Language. https://lingbuzz.net/lingbuzz/007180: Lingbuzz Preprint.

59.

Richens

Halliday

MAK

(1957) Word decomposition for machine translation. In: Annual Round Table Meeting on Linguistics and Language Studies. Washington DC: Georgetown University.

60.

Rizvi

(2015) Internationalization of curriculum: a critical perspective. In: Hayden

Levy

Thomson

(ed) Handbook of International Education. London: Sage, pp. 337–350. DOI: 10.4135/9781473943506.n23.

61.

Rumelhart

Hinton

Williams

(1986) Learning representations by back-propagating errors. Nature 323: 533–536.

62.

Shannon

(1938) A symbolic analysis of relay and switching circuits. Transactions American Institute of Electrical Engineers 57: 471–495.

63.

Shen

Chen

Backes

, et al. (2023) “Do Anything Now”: Characterizing and Evaluating in-the-Wild Jailbreak Prompts on Large Language Models. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2308.03825.

64.

Tzirides

Zapata

Saini

, et al. (2023) Generative AI: Implications and Applications for Education. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2305.07605.

65.

Vaswani

Shazeer

Parmar

, et al. (2017) Lukasz kaiser and illia polosukhin. In: Attention Is All You Need. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.1706.03762.

66.

Vygotsky

(1978) Mind in Society: The Development of Higher Psychological Processes. Cambridge, MA: Harvard University Press.

67.

Weizenbaum

(1966) ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9(1): 36–45.

68.

Whitehead

(1928) Process and Reality, New York NY: Free Press.

69.

Wittgenstein

(1922) Tractatus Logico-Philosophicus, London UK: Routledge.

70.

Jimmy

Ryan

, et al. (2015) Show, Attend and Tell: Neural Image Caption Generation With Visual Attention. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.1502.03044.

71.

Zhai

Massung

(2016) Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. Williston VT: ACM and Morgan & Claypool.

72.

Zhang

Gong

Zhang

, et al. (2023) Meta-Transformer: A Unified Framework for Multimodal Learning. Ithaca, NY: arXiv. DOI: 10.48550/arXiv.2307.10802.