Abstract
Ambridge calls for exemplar-based accounts of language acquisition. Do modern neural networks such as transformers or word2vec – which have been extremely successful in modern natural language processing (NLP) applications – count? Although these models often have ample parametric complexity to store exemplars from their training data, they also go far beyond simple storage by processing and compressing the input via their architectural constraints. The resulting representations have been shown to encode emergent abstractions. If these models are exemplar-based then Ambridge’s theory only weakly constrains future work. On the other hand, if these systems are not exemplar models, why is it that true exemplar models are not contenders in modern NLP?
Keywords
Ambridge (2020) calls for language acquisition researchers to take seriously the possibility that speakers do not possess any linguistic abstractions, and rely only on stored exemplars and fast analogical reasoning for comprehension and production. Part of Ambridge’s argument against abstraction and in favor of exemplar models is the success of exemplar-based computational models in capturing empirical phenomena. Although the article cites exemplar-based models covering a range of phenomena, we were surprised to find a gap in the survey. In the last decade, there has been enormous progress in natural language processing (NLP) on a wide variety of tasks from speech recognition and language modeling all the way to question-answering and inference. The models that have enabled this progress are all variants of multi-layer neural networks, including neural word embedding models (Mikolov et al., 2013) and transformer-based models like BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019). These models are trained on very large data sets and have a very large number of parameters, allowing them to store and recode substantial summaries of their training data.
Under Ambridge’s definition, are modern deep neural models of language examples of exemplar models? The answer to that question is not straightforward. One issue is that the definition of exemplar models on offer does not allow readers to make that judgment. Additionally, there are open questions as to how fully modern language models encode the input they receive and the extent to which they retain it in their fitted parameters.
The current state of the art on many NLP tasks is achieved through transformer-based models, like BERT. In brief, a transformer model encodes a sequence (e.g., words in a sentence of English) into a high-dimensional vector as it is passed through a series of layers, before passing that encoded input back through a series of decoding layers that have been trained for a given task (e.g., translation or part-of-speech tagging). Transformers, sharing features such as many layers and self-attention with other modern language models, have proven capable of learning both nearby and long-distance dependencies, and yet these seemingly abstract rules are only incrementally learned as the connection weights gradually change during training on a large number of exemplars. Is a transformer an exemplar model? Because of the huge number of parameters in these models and the complex ways in which they interact as information is passed through the layers, it is difficult to peer into the model and understand why it does what it does, what representations it learns, and how much information it stores. But using such models as a case study can illuminate aspects of Ambridge’s argument.
On the one hand, if a modern language model is an exemplar model under Ambridge’s definition, then what it means to be an exemplar model may be vacuous since these models appear to do some of the things that Ambridge considers outside the scope of exemplar models, such as representing abstract structures. Whereas NLP systems often made use of explicit, symbolic representations like probabilistic context-free grammars, neural NLP models are not explicitly designed to compute over abstract linguistic structures. So the way they store and manipulate abstract structures is more opaque. But that opacity does not mean they are free of abstraction.
In fact, there has been ample work showing that neural models capture aspects of abstract linguistic structure. For instance, neural networks capture syntactic generalizations necessary for long-distance agreement (Gulordava et al., 2018). In some cases, the mechanism by which neural networks make such generalizations about long-distance agreement is well-understood. A study of one particular architecture, a long short-term memory (LSTM) neural network, showed that there are nodes in the network that respond selectively to abstract syntactic categories, such as subjecthood and number (Lakretz et al., 2019). There are also a number of linguistic structures that can be recovered from neural networks. A probing technique has shown that deep neural models of language encode syntactic tree distance (Hewitt & Manning, 2019). Investigations of BERT show that some parameters within the model appear to selectively identify syntactically relevant abstract categories like determiners for nouns and direct objects for verbs (Clark et al., 2019). Artificial neural networks can even be fruitfully analyzed using a technique that is often used as a paradigmatic case of abstract syntactic representation: syntactic priming (Prasad et al., 2019). If, under Ambridge’s definition, models such as these are exemplar models, then we would argue that exemplar models do not provide evidence ‘against stored abstraction.’
On the other hand, if modern NLP models are not exemplar models under Ambridge’s definition – because they either do not store all the input or because they learn and store abstract structures – that would seem to create a distinction between what Ambridge, citing Chandler (2002), calls ‘de facto exemplar models’ and full-scale exemplar models. Here, ‘de facto exemplar models’ seem to be models with enough parameters to encode the full input. 1 But, in this case, Ambridge does not state why one should prefer a pure exemplar model to this class of de facto exemplar models. Moreover, this distinction undermines the argument that a major strength of exemplar models is their computational success. If BERT and cousins are outside the space of what Ambridge would consider pure exemplar models, then such models do not approach state-of-the-art performance in NLP.
More broadly, there need not be a hard split between models that encode abstract structures and those that store a huge amount of information about the input and allow for fast analogical comparisons. Neural language models in all their variations provide a gradient across these dimensions. They are not explicitly trained to operate over syntactic trees or morphological hierarchies, but their representations may still encode and store those sorts of abstractions to varying degrees depending on their architecture and training.
Situating Ambridge’s account of exemplar models within this space of modern models would help clarify what counts as an exemplar model anyway, just how radical Ambridge’s proposal is, and how these ideas can guide future efforts to constrain the space of models for language acquisition and processing.
