Abstract
Advancements in computational capabilities have enabled the implementation of advanced deep learning models across various domains of knowledge, yet the increasing complexity and scarcity of data in specialized areas pose significant challenges. Zero-shot learning (ZSL), a subset of transfer learning, has emerged as an innovative solution to these challenges, focusing on classifying unseen categories present in the test set but absent during training. Unlike traditional methods, ZSL utilizes semantic descriptions, like attribute lists or natural language phrases, to map intermediate features from the training data to unseen categories effectively, enhancing the model’s applicability across diverse and complex domains. This review provides a concise synthesis of the advancements, methodologies, and applications in the field of zero-shot learning, highlighting the milestones achieved and possible future directions. We aim to offer insights into the contemporary developments in ZSL, serving as a comprehensive reference for researchers exploring the potentials and challenges of implementing ZSL-based methodologies in real-world scenarios.
Introduction
Over the past decade, advances in computational capabilities and the availability of large datasets have paved the way for the application of complex deep learning models across diverse domains of knowledge such as finance, education, and life sciences among others. While the increase in computational power has been significant, the challenges associated with managing increasing data size and complexity have also grown. It should also be noted that in some specialized research areas, the scarcity of data further complicates the development of effective deep learning models.
To overcome the issues mentioned above, transfer learning can be utilized as an effective strategy. This approach involves leveraging pre-trained neural networks that have been designed for a different task, rather than building and training a deep neural network from scratch. Most layers from the existing model can be retained, requiring only the upper layers to be fine-tuned to suit the specific needs of a new application. This not only speeds up the training process but also reduces the amount of data needed for effective training. In this way, transfer learning serves as both a time-efficient and data-efficient solution for the development of deep learning models [1].
Zero-shot learning, a specialized type of transfer learning, that addresses the challenge of identifying categories in the test set that were not present during training. In this context, categories involved in training are referred to as ‘seen’ while those appearing only during testing are called ‘unseen’. Distinct from traditional supervised learning, zero-shot learning does not have the luxury of accessing samples from the “unseen” categories during training. To compensate for this, semantic descriptions are employed for each category, typically provided in the form of attribute lists or natural language phrases. The foundational idea is to extract intermediate features from the training data that can be applied to map test samples to the ‘unseen’ categories. Such intermediate features may encompass elements like color, texture, or specific aspects of objects. Because these features are likely to exist in both seen and unseen categories, they enable the formulation of discriminative descriptions for more complex concepts, thus making transfer learning to unseen classes more approachable [2].
A structured depiction of the various zero-shot learning frameworks that are subject to analysis in this review. The diagram categorizes the frameworks into modality and attribute-based, learning strategy-based, and advanced techniques, with further subdivisions such as hybrid, generative modeling, and instance-based approaches, among others.
Top 20 most cited zero-shot learning papers over the last years
The progression in the realm of zero-shot-based machine and deep learning methodologies has been remarkable, exemplified by an array of review articles and the unveiling of many distinct tools designed to advance numerous domains of knowledge [3, 4, 5, 6, 7] as well as the papers listed in Table 1. Nevertheless, the field remains intricate, attributed to the continuous expansion in the number of knowledge domains adopting machine learning-based tools as well as to the ever-increasing complexity of datasets. Therefore, in light of these developments and challenges, our work aims to provide a comprehensive review of the contemporary developments and innovations in the field of zero-shot learning, spanning broadly over the previous decade. Through this synthesis, we aim to delineate the contemporary research landscape, highlighting both the achieved milestones and forthcoming challenges.
Due to their versitility and adaptability, zero-shot learning techniques are applied across various domains, most prominently in computer vision, to address the limitations inherent to more conventional supervised learning methodologies. Although some of the datasets used for zero-shot learning weren’t originally created for it, they’ve been very helpful for research in this field because they cover a wide range of classes and examples. Below, we explore a variety of publicly available datasets that are frequently used for experimentation and validation of zero-shot learning-based techniques.
The CUB-200–2011 (Caltech-UCSD Birds 200) dataset [27], contains images representing 200 bird species, manually annotated with class labels and bounding boxes. This dataset, originated from the web, has been a notable benchmark for ZSL object recognition and has found applications in various domains including visual feature learning, multi-class recognition, object retrieval, attribute learning, and unsupervised domain adaptation.
Furthermore, ImageNet [28] stands out as one of the most prevalent datasets for image classification tasks, offering a diverse range of image data across 1000 classes. It has become an indispensable resource for a variety of ZSL applications such as image annotation, zero-shot object detection, and image retrieval due to its extensive and varied data offering.
In addition, the Animals with Attributes (AwA) dataset [29], is a benchmark compilation focusing on animal attributes, enriched with abstract attributes like stripes or horns. Comprising 33 animal categories, each with 50 images, it serves as a versatile tool for a range of ZSL tasks including attribute-based image classification.
Similarly, the aPascal/aYahoo datasets [30], extensions of the Pascal and Yahoo datasets respectively, were constructed with ZSL in mind. The former is laden with images across 20 object classes annotated with 200 attributes, while the latter focuses on 10 animal classes with annotations of 128 attributes, enriching the spectrum of ZSL research.
Visual Genome [31] is another remarkable dataset, constructed from Visual Genome images, offering class labels and scene graphs, which has been pivotal for ZSL applications like visual relationship detection and semantic segmentation.
Sun Attributes (SUN) [32], constructed from the SUN dataset, with annotations for 717 categories, provides an alternative to ImageNet for object classification and is a reputable benchmark for ZSL tasks like image annotation and zero-shot learning.
Lastly, the NUS-WIDE dataset [33], containing images from Flickr annotated with 81 concepts, has proven to be suitable for ZSL tasks, including multi-label classification and zero-shot retrieval, widening the horizon for research in zero-shot learning methodologies.
Learning strategy-based methods
Metric learning methods
In the dynamic landscape of zero-shot learning, Metric Learning methods have ascended to prominence, serving as a cornerstone in the pursuit of recognizing novel classes that elude conventional training. These methods specialize in the acquisition and adept utilization of distance metrics and similarity measures, fundamentally enabling the comparison of data points spanning both the familiar and the uncharted. Within this context, Xu et al. [24] contribute significantly by delving into the intricate realm of diagnosing compound bearing faults, harnessing the might of metric learning to discern complex fault patterns. Meanwhile, Huang et al. [34] introduce the Hippocampus-heuristic Character Recognition Network (HCRN), an embodiment of metric learning principles that accentuates the importance of learning features in pairs of input samples, a pivotal facet in metric learning’s arsenal.
Expanding the horizons of this field, Kutbi [35] introduces ZDDA, an algorithm that transcends traditional boundaries, excelling not only in metric learning but also radiating applicability across various domains. McCartney et al. [36] push the envelope further, demonstrating the adaptability of metric learning by seamlessly integrating it into the realm of EEG-based image retrieval, thus underscoring its versatility across divergent domains. Additionally, Xu [37] offers a bespoke encoding method, meticulously designed for decoding individual faults nestled within the complexity of compound fault signals, reiterating the pivotal role metric learning plays in signal processing.
Further advancing the frontier of signal processing, Dong et al. [38] present the ‘signal recognition and reconstruction convolutional neural networks (SR2CNN)’ framework, where metric learning combines harmoniously with loss functions to yield a powerful tool for signal recognition. Fu et al. [39] champion the cause of metric learning, contributing a robust maximum margin framework that breathes life into semantic manifold-based recognition. Meanwhile, Deznabi et al. [17] step into the realm of biology, unveiling DeepKinZero, an innovative approach designed explicitly for predicting kinases responsible for phosphorylating specific protein sites.
Shifting our focus to the visual domain, Ji et al. [40] meticulously craft a deep metric learning framework tailored for image zero-shot learning. Furthermore, Huang et al. [41] introduce CPDN, a deep metric learning model purpose-built for the challenges of Generalized Zero-Shot Learning (GZSL), showcasing the adaptability of metric learning in accommodating diverse zero-shot learning scenarios. In tandem, Ji et al. [42] and Fu et al. [43] enrich the landscape with novel manifold distance metrics, illuminating visual recognition tasks with their innovative contributions. Finally, Guo et al. [44] round off this illustrious list with a one-step recognition framework, primed and ready to tackle the uncharted territories of novel classes, underscoring the indispensable role of metric learning in seamlessly handling data from previously unencountered categories.
Collectively, these algorithms not only underscore the versatility and potential of Metric Learning but also encapsulate the essence of progress within the realm of zero-shot learning research.
Classifier-based methods
Classifier-based methods shine as powerful tools for recognizing previously unseen classes and advancing the boundaries of traditional recognition approaches. These innovative techniques leverage a variety of strategies, including class-level semantics, domain-specific adaptations, and advanced classification techniques.
Lv et al. [45] introduce TSVR, which enhances classifier capabilities through semantic-visual fusion pairs and domain-specific batch normalization, illustrating the potential of fine-tuned classifiers. Yu et al. [46] present KDCGN, a framework that directly generates classifiers conditioned on class-level semantics, streamlining the training process for unseen classes.
Venturing into uncharted domains, Freitas et al. [47] boldly apply zero-shot learning to the classification of marine materials, demonstrating its adaptability across diverse domains. Cheraghian et al. [48] pioneer a 2D zero-shot learning approach meticulously tailored for 3D point cloud classification. In doing so, they effectively address challenges such as domain adaptation, hubness, and data bias, further expanding the horizons of zero-shot learning’s applicability.
Ji et al. [49] contribute a novel method for zero-shot image classification, addressing class-imbalance issues and highlighting the adaptability of classifier-based strategies. Kim et al. [50] propose a deep attribute based on CNN features, enriching the field with discriminative and classifying properties.
Additionally, Liu et al. [51] present a co-training framework for zero-shot learning, fostering iterative knowledge transfer and strengthening classifier capabilities. Li et al. [52] explore the realm of superclasses in both feature and semantic spaces, facilitating knowledge transfer and enhancing recognition of samples from unseen classes. Hou et al. [53] introduce a discriminative comparison classifier tailored for generalized zero-shot learning tasks. Duan et al. [54] focus on Brain-Computer Interfaces (BCI) based on motor imagery for EEG signal recognition, leveraging zero-shot techniques to augment recognition capabilities. Li et al. [55] contribute an innovative technique for zero-shot generalized classification, further pushing the boundaries of classifier-based methods.
Zhang et al. [56] introduce a two-branch network designed to regress to class-level semantic embeddings, enhancing the interpretability and capabilities of classifiers within the zero-shot learning paradigm. Del et al. [57] embark on a journey that explores artwork instance recognition through two distinct approaches, encompassing conventional recognition and the intricate realm of zero-shot learning for unseen artwork instances, highlighting the adaptability of classifiers. Gui et al. [58] propose a pioneering GZSL-based learning approach for PolSAR data classification, predicting labels for both known and previously unseen classes, showcasing the potential of classifier-based strategies in handling complex real-world data. Cheng et al. [59] introduce an ingenious Random Forest-based zero-shot image classification approach, leveraging relative attributes (RAs) to enhance image recognition capabilities, demonstrating the versatility of classifiers in diverse image analysis tasks. In parallel, Qin et al. [60] harness a class-wise discrete descent algorithm and a multi-output neural network to predict multiple attributes from low-level features, transcending traditional image classification methods and highlighting classifiers’ capacity to extract rich attribute information. Complementing these innovations, Liu et al. [61] present a framework that simultaneously learns attribute-attribute connections and attribute classifiers, offering a holistic perspective on attribute-based classification and the interplay of features.
These diverse contributions collectively underscore the significance of classifier-based methods in addressing intricate recognition tasks across a spectrum of domains, from art recognition to remote sensing, image classification, and beyond.
Instance-based methods
Within the realm of zero-shot learning, instance-based methods have garnered attention for their unique approaches and innovative strategies. Yang et al. [62] introduce an Iterative Class Prototype Calibration technique, emphasizing the iterative refinement of class prototypes as a key strategy. Pham et al. [63] take a distinctive approach with PencilNet1, a method that detects racing gates by uniting predictions into a single pose tuple, demonstrating an instance-based perspective in its operation. Zarei et al. [64] tackle domain shift and the hubness problem in zero-shot learning through the use of a learned kernel distance function and a theoretical-based prototype learning strategy, enhancing the adaptability of zero-shot learning models. Li et al. [65] propose the Multiple Semantic Subspaces Network, which leverages the concept of semantic subspaces for improved zero-shot learning performance. Meanwhile, Xie et al. [66] introduce an innovative improvement to zero-shot learning by utilizing unseen images to train a model more effectively, with the aid of a novel training dataset called Virtual Mainstay samples.
In a semi-supervised fashion, Xu et al. [67] present the Low-Rank Semantic Grouping (LSG) model, which seeks to enhance the performance of zero-shot learning. Xie [68] contributes a Feature Enhancement Framework tailored for zero-shot learning tasks, enriching feature representations. Song et al. [69] explore the use of physics-based electromagnetic simulated images for learning the features of unseen targets within a zero-shot learning context, demonstrating the versatility of instance-based methods. Liu et al. [70] propose a Convolutional Prototype Learning framework that accounts for distribution conformity, enhancing the discriminative power of prototypes. Lv et al. [71] address bias reduction towards seen classes in zero-shot semantic segmentation, presenting a novel approach to promote fairness in recognition. Rahman et al. [72] introduce a Deep Multiple Instance Learning framework, shedding light on the potential of multiple instance learning techniques. Finally, Guo et al. [44] present a framework that incorporates transferred samples from source classes with pseudo labels and modifies the standard support vector machine formulation, offering a unique perspective on zero-shot learning.
Learning strategy-based methods
Learning strategy-based methods
The diverse learning strategy-based methods mentioned above and listed in Table 2 below, collectively contribute to the evolving landscape of zero-shot learning, showcasing innovative approaches and strategies tailored for various recognition tasks and challenges.
Generative modelling
Generative modeling stands as a formidable pillar within the realm of zero-shot learning, harnessing the power of probabilistic modeling to overcome the challenges of recognizing unseen classes. This category hosts a diverse ensemble of innovative approaches, each meticulously crafted to narrow the divide between known and unknown categories. Liu et al. [73]introduce the discriminative cross-aligned variational autoencoder (DCA-VAE), a model dedicated to learning the intricate joint distribution of classes and attributes, paving the way for deeper understanding and more accurate predictions.
Meanwhile, Cheng et al. [74]unveil a hybrid routing transformer tailored explicitly for zero-shot learning tasks, drawing from the transformative capabilities of the transformer architecture to enhance recognition performance. Addressing bias concerns in feature generation for unseen classes, Yang et al. [75]present the ABA-GAN, a generative adversarial network that takes a proactive stance on fairness. In parallel, Ye et al. [76]introduce LCR-GAN, a GAN-based method that aligns the distributions of visual features and semantic attributes, thereby enriching zero-shot learning’s potential. Gao et al. [77]propose a bidirectional generative network fortified with cycle consistency, effectively bridging the chasm between visual and semantic domains. Li et al. [78]bring forth the AMAZ attribute-modulated generative meta-model, offering novel avenues for leveraging generative capabilities in zero-shot learning endeavors. In the pursuit of learning enhancement, Wei et al. [79]leverage generative replay techniques to augment the learning process. Tang et al. [80]contribute to this landscape by introducing a dedicated GAN structure tailored for zero-shot learning, enhancing the synthesis of visual features for unseen classes. In a parallel development, Liu et al. [81]delve into the domain of WGAN-based sample synthesis, harnessing the power of Generative Adversarial Networks (GANs) to create samples that bridge the gap between known and unknown categories. Mahapatra et al. [82]take a self-supervised learning approach and employ GradCAM saliency maps to synthesize features for unseen classes, showcasing the versatility of generative models.
Exploring the bidirectional connection between visual and semantic spaces, Li et al. [83]introduce Boomerang-GAN, a model that outperforms previous approaches in recognition and segmentation tasks. Guo et al. [84]employ meta-learning to generate fake visual features, effectively addressing domain bias issues with their CMPN model, while Xie et al. [85]present MGA-GAN, a Generative Adversarial Network tailored for generalized zero-shot learning. Gull et al. [25]advance the field with iVAE, a model based on the Variational Autoencoder (VAE) that excels in zero-shot learning tasks. Ma et al. [86]introduce GAN-MVAE, a fusion of a generative adversarial network and a multi-modal variational autoencoder, paving the way for generalized zero-shot learning. Shinzaki et al. [87]explore robust adversarial reinforcement learning techniques to tackle zero-shot adaptation in beam-tracking, demonstrating the adaptability of generative models across diverse applications within zero-shot learning.
The frontier of zero-shot learning is marked by the relentless innovation of generative models, which play a pivotal role in expanding its boundaries. Liu et al. [88]employ a cascade Generative Adversarial Networks (GANs) strategy to forge a path towards feature generation, enriching the model’s capacity for zero-shot tasks. Chen et al. [89]introduce a novel flow-based generative framework tailored for Generative Zero-Shot Learning (GZSL), setting the stage for enhanced feature synthesis. Shermin et al. [90]navigate the GZSL landscape with the bidirectional mapping coupled generative adversarial network (BMCoGAN), leveraging bidirectional mappings to advance feature synthesis capabilities. Deng et al. [91]usher in the Quality-Verifying Adversarial Network (QVAN), augmented with an l12 constraint, elevating feature synthesis quality. Ye et al. [92]tackle the persistent domain shift challenge in GZSL tasks through their Discriminative Learning GAN, effectively aligning distributions to enhance feature generation. Li et al. [93]pioneer the Augmented Semantic Feature Based Generative Network (ASFGN), dedicated to the synthesis of visual features for unseen classes. Luo et al. [94]present a groundbreaking Dual VAEGAN framework, unifying Variational Autoencoders (VAEs) and GANs, producing clear visual features for zero-shot learning.
Xie et al. [95]unveil a Generative Network-Based approach that leverages semantic features as input to synthesize visual features as output, bridging the gap between domains. Guo et al. [96]introduce a zero-shot augmentation learning model (ZSAL) that collaborates with medical professionals to generate virtual images for the computer-aided diagnosis of rare diseases. Feng et al. [97]pioneer a Dual-knowledge-source-based generative model, while Liu et al. [98]introduce the Cross-class generative network. Li et al. [99]contribute to the landscape with a GAN-based ZSL approach, and Song et al. [100]unveil the Domain-aware Stacked AutoEncoder (DaSAE), a model built on two interactive stacked auto-encoders for domain-aware projections.
Continuing the narrative of generative models in zero-shot learning, Ponti et al. [101]present a Bayesian generative model tailored for neural parameters within unseen task-language combinations. This innovation opens doors to more intricate and nuanced learning scenarios. Geng et al. [102]introduce a knowledge graph-based framework for Zero-Shot Learning (ZSL), featuring an attentive Zero-Shot Learner (AGCN) and an explanation generator. This model taps into the rich resource of knowledge graphs to enhance the learning process. Wang et al. [103]address the challenging zero-shot domain adaptation problem by developing a Conditional Coupled Generative Adversarial Network (CoCoGAN), leveraging generative capabilities to adapt to new domains seamlessly. Kim et al. [104]present the Zero-Shot Generative Adversarial Network (ZSGAN), a model designed to tackle the challenges posed by data imbalance, particularly pertinent in zero-shot scenarios. Ma et al. [105]propose the similarity-preserving GAN (SPGAN) to generate visual features for unseen classes while preserving the similarity relationships within the data. Liu et al. [106]advance the field with a dual-stream GAN, designed to excel in zero-shot visual classification tasks. Chi et al. [107]introduce the Dual Adversarial Distribution Network (DADN), specially crafted for zero-shot cross-media retrieval, showcasing the versatility of generative models in diverse applications. Gao et al. [108]contribute a zero-shot learning method based on contractive stacked autoencoders, providing a unique approach to feature generation. Shao et al. [109]propose a multi-channel Gaussian Mixture VAE model that excels in generalized zero-shot learning tasks, leveraging the power of Gaussian Mixture models. Gao [110]introduces Zero-VAE-GAN, while Ding et al. [111]develop a two-stage generative adversarial network tailored specifically for zero-shot learning. These models collectively push the boundaries of what is possible in zero-shot learning through generative prowess.
Hybrid methods
A cluster of hybrid-based methods has emerged, each fusing different techniques and paradigms to address the complex challenges presented by unseen classes and data scarcity. Ji et al. (2021) introduce the UPL method, which leverages the power of two constraints – an autoencoder and a triplet loss – within the episodic training paradigm, showcasing its adaptability in both traditional ZSL and generalized GZSL settings [112]. In parallel, Zhang et al. (2022) propose a pioneering SMDM-based approach, bridging the gap between familiar and unfamiliar concepts by inferring unseen relations from seen relations using semantic metrics generated by BERT [113]. Yao et al. (2023) introduce GhostShuffleNet (GSNet), a specialized framework tailored for the recognition of Unmanned Aerial Vehicle (UAV) images. GSNet stands out by amalgamating the Zero-Shot Neural Architecture Search (NAS) algorithm with other pertinent features, thereby showcasing the significance of domain-specific optimizations in Zero-Shot Learning (ZSL) systems operating in niche domains such as UAV imagery [114].
Li et al. (2023) present BGSNet, a two-branch, end-to-end network that offers a unique perspective on ZSL. BGSNet excels by harmonizing generalization and specialization capabilities, operating at both the instance and dataset levels. This approach underscores the importance of balance and synergy between these two critical facets of ZSL to enhance recognition accuracy across diverse datasets and instances [115]. Hu et al. (2022) present a hybrid approach that harnesses both a feature-attribute embedding model and a generative feature model to bridge the gap between visual and semantic domains [116]. In parallel, Ao et al. (2022) introduce a cross-modal prototype learning method, Ao et al. (2022), which integrates knowledge from both textual and visual modalities to enhance zero-shot learning performance [117]. Dong et al. (2022) propose a G-ZSL method that utilizes two statistical techniques to establish boundaries between domains, facilitating knowledge transfer between seen and unseen classes [22]. Li et al. (2022) contribute the ERPCNet, an effective, efficient, and explainable model that demonstrates its ability to transfer knowledge from observed to unseen classes in both ZSL and GZSL settings [118]. Liu et al. (2022) introduce a semantics-guided spatial attention mechanism and learn discriminative prototypes for each class [119]. Yun et al. (2022) present SALN, which employs an
Bian et al. (2022) leverage cross-modality information and relation prototypes, deploying them effectively for classifying previously unseen medical images [122]. Meanwhile, Li et al. (2022) enhance classification results through a fusion of a semantic embedding network and an auxiliary classifier [123]. Song et al. (2022) introduce the Semantic-Visual Combination Propagation Network (CPN), which seamlessly combines semantic and visual representations while incorporating an auto-encoder to bridge the gap between these domains [124]. Zhang et al. (2022) contribute Cluster-Prototype Matching (CPM), harnessing sample distribution information and the Kuhn-Munkres algorithm to match clusters with class prototypes, thereby improving zero-shot classification [125]. Lu et al. (2022) propose a GZSL meta-learning approach that leverages class-level semantic knowledge and employs an entropy gate approach to tackle complex recognition tasks [126].
Shermin et al. (2022) introduce an integrated network that employs two sub-networks for the EL and FS categories of methods. It utilizes mutual learning and mutual information, exemplifying the integration of diverse techniques in hybrid zero-shot learning [127]. Liu et al. (2022) present AREES, a comprehensive approach that combines an attention mechanism, a decomposition structure, and a multimodal VAE, demonstrating its hybrid nature [128]. Li et al. (2022) contribute TUPL, designed specifically for the GZSDA challenge, showcasing its adaptability to complex zero-shot scenarios [129]. Chen et al. (2022) introduce GNDAN, which incorporates RAN and RGAT to generate both global and local embeddings, effectively addressing challenges in zero-shot learning [130]. Kwon et al. (2022) employ a two-stream autoencoder-based gating model, a hybrid approach focusing on feature generation and efficiency [131]. In parallel, Xu et al. (2022) combine generative mixup networks with semantic graph alignment and a triplet gradient matching loss, exemplifying the fusion of generative and discriminative methods for improved performance [132]. Jia et al. (2022) explore active learning for a visual explainable approach, adding another dimension to the hybrid landscape [133]. Lastly, Ye et al. (2022) employ triplet loss for ZSL image classification, effectively harnessing the power of generative adversarial networks in their approach [134].
Yao et al. (2022) propose an attribute-induced bias-eliminating (AIBE) module and an attention graph attribute embedding process, showcasing their commitment to eliminating biases and improving attribute-based recognition [135]. Li et al. (2021) present Locality-Preservation Deep Cross-Modal Embedding Networks (LPDCMENs), an end-to-end method tailored for zero-shot remote sensing scene classification [136]. Liu et al. (2021) introduce an adversarial strategy involving a projector and classifier, revolutionizing unseen object recognition [137]. Zhang et al. (2021) contribute an encoder-decoder framework with an attention mechanism, adding another layer of sophistication to the zero-shot learning landscape [138]. Nihal et al. (2021) leverage the Linear Discriminant Analysis (LDA) classifier and DenseNet101 for Bangla sign language recognition, showcasing the versatility of zero-shot learning across domains [139].
Qian et al. (2021) propose a Cross-Domain Lifelong Reinforcement Algorithm with Zero-Shot Policy Generation (CDLRL-ZPG), highlighting the potential of reinforcement learning in zero-shot settings [140]. Xie et al. (2021) present a GAN-CST-based approach incorporating Class Knowledge Overlay (CKO), semi-supervised learning, and a triplet loss, demonstrating the power of combining multiple techniques [20]. Xu et al. (2021) introduce complementary attributes and rank aggregation as a supplement to existing methods, exemplifying a collaborative approach to zero-shot learning [141]. Min et al. (2021) contribute the Domain-Oriented Semantic Embedding (DOSE) network, a domain-specific approach that incorporates specialized sub-projections and a cycle consistency approach [142]. Ding et al. (2021) utilize a latent space and two domain classifiers for both ZSL and supervised classification tasks, showcasing the potential for hybrid methods [143].
In a distinctive approach, Wen et al. (2020) introduce a ZSL-based method rooted in Traditional Chinese Medicine concepts, bridging the gap between ancient wisdom and modern AI [14]. Zhang et al. (2020) present a deep learning architecture that addresses domain shift problems in GZSL through a KL Divergence constraint, exemplifying the use of constraints in zero-shot learning [144]. Wang et al. (2020) propose Deep Attribute Prediction (DeepAP), a model that leverages a class-attribute matrix to explore attribute-class correlations and incorporates weighted attributes for zero-shot image classification [145]. Zhang et al. (2020) put forward a hierarchical prototype learning approach (HPL) for zero-shot recognition, leveraging class prototypes and semantic spaces to differentiate between seen and unseen classes [146]. Li et al. (2020) contribute a zero-shot learning procedure that maintains semantic consistency between visual and semantic spaces while learning class prototypes, demonstrating the significance of semantic alignment [147].
Zhang et al. (2019) introduce a probabilistic model with triplet learning and Non-Negative Matrix Factorization (NMF), illustrating the integration of probabilistic methods and traditional machine learning techniques [148]. Ji et al. (2020) innovate with an adversarial feature fusion network that fuses different class semantic prototypes to generate pseudo visual features, highlighting the power of feature fusion [149].
Liu et al. (2020) contribute to Explainable Zero-Shot Learning (XZSL) with a novel vision-attribute embedding module and a multi-channel explanation model, shedding light on the interpretability of ZSL systems [150]. Changpinyo et al. (2019) propose two innovative frameworks for ZSL using manifold embeddings and synthesized “exemplars,” expanding the repertoire of techniques available for handling the complexities of unseen class recognition [151]. Jia et al. (2020) introduce the DUET model, comprising a Deep Embedding Transfer (DET) module and an Unseen Visual Feature Generation (UVG) module, pushing the boundaries of feature transfer and visual feature synthesis in ZSL [152].
Liu et al. (2020) present the Label-Activating Framework (LAF) through Indirect Attribute Prediction (IAP) for Generalized Zero-Shot Learning (GZSL), emphasizing the role of attribute predictions in improving recognition across seen and unseen classes [21]. Ding et al. (2019) propose the Cross-Domain Mapping (CDM) model, addressing the domain shift problem in ZSL by mapping visual features to a common domain, showcasing the importance of domain adaptation [153]. Jiang et al. (2019) put forward a novel ZSL method leveraging class similarities to adjust the visual-semantic embedding for unseen classes, highlighting the value of semantic alignment [154].
Ji et al. (2019) introduce a synthesized approach based on dictionary learning, merging traditional machine learning techniques with ZSL concepts [155]. Zhang et al. (2019) present a hybrid approach involving random attribute selection and conditional GAN, demonstrating the potential for combining various strategies to enhance ZSL [156]. Zhang et al. (2019) further contribute with a dual-verification network for zero-shot classification, highlighting the importance of verifying both feature and attribute spaces for accurate recognition [157].
Yu et al. (2018) propose ASTE and SPASS techniques to improve the accuracy of unseen class recognition, along with a fast training (FT) strategy to enhance classification efficiency, reflecting their dedication to practical advancements in ZSL [158]. Liu et al. (2018) introduce CORL, a fusion of ontology and reinforcement learning, to construct classification rules based on attribute annotations, underscoring the fusion of knowledge-driven and data-driven approaches [159].
Sumbul et al. (2018) use image features acquired through a CNN and additional information from manually selected attributes, a natural language model, and a scientific taxonomy for the identification of street trees in aerial data, showcasing the versatility of ZSL in diverse domains [160]. Song et al. (2017) propose a deep neural network architecture consisting of a generator and an interpreter to tackle the issue of limited training samples for Automatic Target Recognition (ATR) of Synthetic Aperture Radar (SAR) [161]. Yu et al. (2017) introduce the Regularized Cross-Modality Ranking (ReCMR) approach, emphasizing the exploration of relationships between different modalities through hinge ranking loss and regularizers [162].
Generative modelling and hybrid zsl methods
Generative modelling and hybrid zsl methods
Ji et al. (2017) focus on manifold constraints and domain adaptation for knowledge transfer with MCME-DA, exemplifying the significance of domain alignment and constraints [163]. Fu et al. (2015) propose an approach leveraging transductive multi-view embedding and heterogeneous multi-view hypergraph label propagation for effective zero-shot recognition, illustrating the potential of multi-view and transductive learning techniques [164]. These innovative methods which are also listed in Table 3, continue to shape and advance the field of ZSL, offering a rich tapestry of approaches to tackle the complexities of recognizing unseen classes across diverse domains and applications.
Multi-modal ZSL
In the realm of Multi-modal Zero-Shot Learning (MZSL), researchers have introduced innovative approaches to tackle the challenges associated with leveraging multiple data modalities for enhanced recognition performance. Cao et al. (2022) propose a Multi-modal feature fusion model designed to excel in supervised learning tasks. This model showcases the significance of fusing information from diverse modalities to bolster recognition accuracy across various domains [165].
Chen (2022) addresses a crucial issue encountered in Generalized Zero-Shot Learning (GZSL) by introducing MM-APANN. This model focuses on mitigating incongruence between visual features and semantic attributes, contributing to more effective recognition in multi-modal settings [166].
Additionally, exploring the broader implications of Multi-modal ZSL, researchers have delved into domains such as robotics. A notable example is the work by Lázaro-Gredilla et al. (2019), which discusses a comprehensive framework for aiding robots in interpreting high-level concepts. This framework incorporates principles from mental imagery and other pertinent sources, signifying the importance of multi-modal approaches in enabling robots to comprehend and interact with their surroundings [167].
Multilabel MZSL
Graph Convolution Networks have proven to be a valuable tool in addressing the complexities of Multi-label Zero-Shot Learning (MZSL). Ou et al. (2020) present a Graph Convolution Network-based MZSL model, highlighting the potential of graph-based techniques to facilitate the recognition of multiple labels associated with a single instance. This approach demonstrates the importance of leveraging graph structures to enhance the performance of MZSL systems, particularly when dealing with multi-label scenarios [168].
Graph-based methods
Graph-based approaches have become instrumental in advancing Attribute-Based Zero-Shot Learning (ZSL), unraveling the latent potential of attribute relationships. Within this dynamic field, these methods have redefined how attributes are harnessed to create more effective ZSL models. In the evolving landscape of Attribute-Based Zero-Shot Learning (ZSL), graph-based methodologies have emerged as pioneers, harnessing the power of attribute relationships to redefine the efficacy of ZSL models. One such groundbreaking innovation is MR-Selection, introduced by Feng et al. (2023), which offers a novel zero-shot band selection approach for hyperspectral image (HSI) classification. This method leverages dynamic structure-aware graph convolutional networks to yield remarkable results [169].
Roy et al. (2022) have also made a substantial contribution by devising a graph convolution network-based autoencoder that generates commonsense embeddings. This innovative approach enhances the interpretability of visual data and bolsters ZSL models’ performance [19].
The realm of graph-based techniques continues to evolve with Xu et al.’s (2022) introduction of Poincaré graph convolutional networks, pushing the boundaries of graph-based methods and their application in ZSL [170]. Furthermore, Wang et al. (2022) have unveiled LND-GMF, a methodology featuring a neighborhood-based gating system. This approach represents a significant stride in improving attribute-based ZSL models [171].
Mancini et al. (2022) have enriched this landscape by presenting Compositional Cosine Graph Embedding, a potent technique for effectively capturing attribute relationships [172]. Additionally, Gao et al. (2021) have introduced the Prototype-Sample Graph Neural Network (PS-GNN), specifically designed for video ZSL, underscoring the versatility of graph-based ZSL methods [173].
Moreover, Wang et al. (2021) have taken a pioneering step by presenting GAZL, a novel active learning approach tailored for designer convolutional graph networks (GCNs). This method finds application in zero-shot image classification, adding a valuable facet to the evolving field of graph-based ZSL [174]. These innovative graph-based approaches collectively illuminate the path towards more robust and efficient ZSL models by exploring and harnessing the intricate web of attribute relationships.
Embedding-based methods
Within the domain of Embedding-Based Zero-Shot Learning (ZSL), innovative approaches are reshaping the landscape of attribute-driven models, offering novel perspectives on feature representations and semantic embeddings. Rao et al. (2023) introduce the Dual Projective model, strategically designed to navigate the challenges of generalized ZSL by adeptly combining both feature and semantic spaces, thereby enhancing model performance in diverse scenarios [8]. Meanwhile, Han et al. (2022) delve into the realm of semantic contrastive embedding, unveiling a technique that leverages the power of semantic information to refine feature representations, ultimately contributing to the improved effectiveness of ZSL models [175].
The intriguing convergence of graph-based methodologies with embedding-based techniques is exemplified by Hu et al.’s (2022) work on RIAEm, a method that incorporates a region graph network and attribute feature embedding. This hybrid approach offers a promising avenue for exploring the synergy between graph structures and embeddings in ZSL [176]. Barros et al. (2022) break new ground with Malware-SMELL, a ZSL model rooted in an S-Space representation, catering to the unique challenges of malware classification through embedding-based techniques [177].
In the field of bioinformatics, Kulmanov et al. (2022) present DeepGOZero, a pioneering application of ontology embeddings for more accurate predictions of proteins’ functions, showcasing the wide-ranging impact of embedding-based methods [178]. Moreover, Liu et al. (2022) introduce LSA and SRSA, innovative approaches aimed at addressing the projection domain shift problem, enhancing the adaptability of zero-shot learning models through embeddings [179]. Yang et al. (2021) introduce an embedding-based ZSL model with a self-focus mechanism, shedding light on the significance of attention mechanisms in embedding-based approaches [180].
An entirely different perspective of embedding-based ZSL is unveiled by Buckchash et al. (2021) with their Zero-Shot Visual Anomaly Recognition (VAR) approach. It operates on raw image frames drawn over a Grassmann product space, offering a unique perspective on anomaly detection and classification [26]. Furthermore, Lin et al. (2021) propose Class Label Autoencoder with Structure Refinement (CLASR), a novel ZSL approach that adapts to multi-semantic embedding spaces, highlighting the importance of structured embeddings in ZSL [181]. Lastly, Wang et al. (2021) pioneer a group-based attribute/object collaborative learning model that employs a structured sparse method to constrain model parameters, thereby enhancing the efficacy of zero-shot learning in complex scenarios [182]. Guo et al. (2021) present AMS-SFE, a method that ingeniously employs a shared semantic feature space and an autoencoder-based expansion of semantic features to bridge the gap between seen and unseen classes, demonstrating the power of collaborative embeddings in ZSL [183].
In a different vein, Yu et al. (2019) introduce the latent space encoding (LSE), an encoder-decoder approach that unfolds new dimensions in zero-shot learning. LSE utilizes latent spaces to create discriminative embeddings, redefining the way feature representations are employed in ZSL [184]. Long et al. (2018) tackle the challenge of data synthesis for unseen classes with their UVDS approach, providing a robust framework for generating previously unseen data instances, thereby extending ZSL to uncharted territories [185]. Jiang et al. (2019) propose a novel perspective on zero-shot learning by leveraging class similarities to adjust visual-semantic embeddings. This approach offers a flexible means of tailoring embeddings to the intricacies of different ZSL scenarios [186].
Jin et al. (2019) delve into the intricacies of embedding-based ZSL, incorporating center loss and a varying learning rate to enhance feature discrimination and the overall learning process, underscoring the importance of optimization strategies in ZSL [18]. Shen et al. (2019) embark on a journey into binary embedding-based zero-shot learning, exploring the potential of binary representations to capture complex semantic relationships, providing new insights into encoding semantics in ZSL [187]. Meng et al. (2019) pioneer a new framework for ZSL, collaboratively learning a latent subspace and cross-modal embedding, illustrating how a fusion of modalities and latent representations can enrich the ZSL process [188]. Niu et al. (2019) introduce an adaptive approach to visual-semantic mapping, accompanied by progressive label refinement for ZSL. Their scalable version, DEEP AEZSL, empowers ZSL models to adapt and refine their semantic mappings in a data-driven manner, addressing challenges related to evolving data distributions [189].
Rahman et al. (2018) shed light on Class Adapting Principal Directions (CAPDs), a novel method for mapping image features to semantically meaningful spaces, offering a fresh perspective on feature representations in ZSL [190]. Meng et al. (2018) introduce a low-rank-representation (LRR) based manifold-regularization approach, which seamlessly incorporates locality and similarity information to foster the learning of discriminative semantic representations, showcasing the versatility of embedding-based techniques [191]. Long et al. (2018) propose a comprehensive ZSL framework that effectively maps semantic embeddings to a discriminative representation space, integrating KLDA, CLN, and KRR to create a powerful tool for zero-shot learning across diverse domains [192]. Lastly, Ji et al. (2017) embark on a journey of fusion, combining various types of side information and visual features into a shared semantic space, revealing a holistic approach to embedding-based ZSL, where diverse sources of knowledge converge [193].
Modality and attribute-based methods
Modality and attribute-based methods
These pioneering modality and attribute-based methods collectively also listen in Table 4 redefine the boundaries of ZSL, demonstrating the power of semantic embeddings and feature representations in addressing the intricacies of real-world applications across diverse domains.
Attention mechanisms
Advanced Techniques for Zero-Shot Learning (ZSL) have witnessed a transformative shift with the incorporation of attention mechanisms. These mechanisms have emerged as indispensable tools for unraveling the intricacies of both visual and semantic data, substantially enhancing the interpretability and overall performance of ZSL models.
In a recent innovation by Xie et al. (2023), an end-to-end attention-based embedding network takes center stage, dedicated to uncovering the most salient image components for ZSL. This approach offers the dual advantage of precisely localizing relevant image regions and extracting discriminative features essential for accurate ZSL [3].
Yang et al. (2022) present the Spatial Response Attention (SRA) model, which leverages spatial attention localization. This model takes a step further by introducing a novel Attribute Attention Cross Entropy loss, refining the alignment between visual features and semantic attributes with unprecedented precision [194]. Meng et al. (2022) harness the potential of attention mechanisms to focus on pivotal image parts essential for distinguishing categories. This innovative approach refines the process of feature extraction, further elevating the ZSL performance by concentrating on the most informative regions of the visual data [195].
In another avenue of research, Zhu et al. (2022) combine attention mechanisms with Kullback-Leibler divergence, forging a powerful synergy between interpretability and information theoretic principles. This amalgamation enriches the understanding of the underlying data distribution, paving the way for enhanced ZSL [9].
Conversely, in the study by Liu et al. (2021), the Semantic-diversity Transfer Network introduces a multi-attention architecture, emphasizing not only the richness of semantic information but also the integration of diverse attention mechanisms. This holistic framework provides a comprehensive perspective on data, bolstering the capabilities of ZSL models [196]. These pioneering advances underscore the pivotal role of attention mechanisms in reshaping how ZSL models interpret both visual and semantic information, ultimately expanding the boundaries of performance and interpretability in real-world ZSL applications.
Other
Other advanced zsl methods
Other advanced zsl methods
The field of zero-shot learning expands beyond traditional applications, with innovative tools and techniques emerging to tackle a diverse range of challenges Table 5. In recent developments, a novel Deep Attention Relation Network (DARN) has been crafted to enhance bearing fault diagnosis (BFD), paving the way for improved machinery reliability [197]. Simultaneously, a vision-based system emerges to assess the productivity of excavators engaged in earthmoving tasks, marking significant strides in industrial efficiency and monitoring [15].
Moreover, advancements extend to the core architecture of zero-shot learning models. ZeroNAS, a specialized neural architecture, stands out as a game-changer, surpassing conventional methods and opening doors to new possibilities in model design [198]. On the frontier of continual learning, Tf-GCZSL introduces task-free generalized continual zero-shot learning, revolutionizing how machines adapt and acquire knowledge over time [199]. The realm of meta-learning leaves an indelible mark on zero-shot learning as well, with innovative models demonstrating the capacity to learn and generalize from limited labeled data [200].
In the realm of robustness, focus has sharpened on the resilience of discriminative ZSL models to image corruptions, shedding light on the model’s reliability under challenging real-world conditions [201]. Meanwhile, NucNormZSL leverages nuclear norm to engineer a low-rank solution within the source domain and applies regularization in the target domain, exemplifying innovative domain adaptation techniques [202].
Expanding into the domain of semantic segmentation, a groundbreaking meta-learning-based model redefines zero-shot semantic segmentation, addressing complex challenges in scene understanding [203].
Zhao and colleagues [204] put forward ZSL-CPLSR, a novel approach designed to tackle the intricate challenges tied to recognizing previously unseen classes. Expanding into the domain of natural language processing, Pamungkas et al. [205] delve into the investigation of low-resource language detection for hate speech. Their pioneering work revolves around joint teaching models that harness diverse bilingual language presentations.
Furthering the sophistication of zero-shot learning systems, Wan [206] introduces a visual structure constraint applied to category centers. This innovative approach fortifies the projection of unseen semantic midpoints, contributing to the advancement of Zero-Shot Learning systems. Kim [207] pioneers a self-supervision technique tailored for vector-level CNN features, elevating the performance of zero-shot learning outcomes. Meanwhile, Zhang and colleagues [208] present the Generic Plug-in Attribute Correction (GPAC) module, a remarkable addition that enhances existing models for Generalized Zero-Shot Learning (GZSL). This module diligently preserves the semantic meaning of attributes, addressing challenges specific to the GZSL setting. Shifting focus to the intricacies of under-constrained ZSL problems, Han [209] introduces the Inf-FG framework. This framework employs two parallel streams, offering a comprehensive strategy to tackle the challenges inherent in such scenarios. Wu [210] details an innovative algorithm founded on the Gauss-Seidel iteration and Barzilai-Borwein stepsize. This algorithm has a pivotal role in reducing domain shift and mitigating information loss, further enhancing the robustness of zero-shot learning models.
Yang [211] introduces SELAR, a focused approach aimed at enhancing the performance of Generalized Zero-Shot Learning (GZSL). This method contributes significantly to the advancement of GZSL capabilities.
Turning to the domain of image indexing and retrieval, Kan [212] presents the LTI-ST system. This novel system prioritizes efficiency and scalability, marking a noteworthy contribution to the field. Zhang and colleagues [213] propose the PSD method, designed explicitly to address challenges related to Discriminative Classifiers (DCNs) in the context of generalized ZSL.
Cross-modal transfer between visual and tactile modalities is a unique challenge addressed by Liu [214], who introduces a structured approach employing dictionary learning. On the front of feature extraction, Luo [215] pioneers a feature extractor tailored to datasets. This innovative technique mitigates the issue of feature mismatch when utilizing pre-trained neural networks for Zero-Shot Learning.
Meanwhile, Wang [216] unveils DDIP, a novel method strategically designed to tackle knowledge transfer challenges between distinct classes and domains in both Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL). In the domain of action recognition, Mishra [217] explores unsupervised methods to overcome limitations associated with supervised techniques.
Pradhan [218] shifts focus to land cover mapping in Malaysia, leveraging high-resolution orthophotos to delve into the applications of Zero-Shot Learning. Lastly, Tang [219] pioneers a noise-contrastive estimation method, contributing to the transfer of knowledge from seen categories to unseen ones, further enriching the ZSL landscape. Rostami [220] introduces coupled dictionary learning as an approach to lifelong learning within the context of zero-shot learning, opening doors to new possibilities and expanding the field’s horizons.
Wang et al. [221] lead the charge by proposing a deep learning framework that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) for tactile material recognition, promising advancements in understanding materials through touch. Building on this progress, Liu et al. [222] introduce a pioneering generalized zero-shot learning approach that leverages web video knowledge to detect anomalous activities in surveillance videos, potentially revolutionizing the field of security and anomaly detection.
In tandem, Zhang et al. [223] present an optimization approach designed to address the challenges posed by Generalized Zero-Shot Learning (GZSL). They frame GZSL as a triple verification problem and employ complementary losses, enhancing the robustness of GZSL models. Li et al. [224] delve into the realm of hubness issues and domain shifts within Zero-Shot Learning with their LUVP method, aiming to improve the reliability and accuracy of ZSL models. Yu et al. [225] break new ground with their novel ZSL approach for computer vision, harnessing bidirectional mapping-based semantic relationship modeling to reshape the way machines perceive visual concepts and their intricate relationships.
Advancing robotics, Abderrahmane et al. [226] put forth an optimized Zero-Shot Learning algorithm for haptic recognition, enabling a robot hand to recognize novel objects without prior training data. Expanding the scope, Zhang et al. [227] describe a deep semi-supervised method that uses descriptive texts instead of labels to obtain more accurate semantics of the categories. Addressing domain shift and hubness problems, Luo et al. [228] formulate Zero-Shot Learning as attribute regression, offering novel insights into mitigating these challenges.
In an era where data are simultaneously expanding and become more intricate, effectively training deep learning models has become an extremely complex task, especially in domains characterized by lack of large datasets. Transfer learning emerges as a pivotal strategy in this context, providing a pathway to harness pre-existing neural networks for new tasks, thereby economizing on data and computational resources. As a category of transfer-learning, zero-shot learning (ZSL), tackles the nuanced challenge of identifying and categorizing ‘unseen’ data during testing, leveraging semantic descriptions and intermediate features extracted from ‘seen’ training data to navigate through the unknowns. Together, transfer learning and ZSL present an indispensable tool, that can be used for tackling problems inherent to the complexities of vast and sparse datasets, and holding the promise of propelling effective model development across various knowledge domains.
Comparative evaluation of different categories of zero-shot learning algorithms involves a comprehensive analysis of their performance, scalability, generalization capabilities, and robustness across diverse datasets and domains. Semantic-based methods excel in capturing high-level semantic relationships but may face challenges in handling fine-grained distinctions and noisy data. Embedding-based approaches offer flexibility and scalability by directly learning feature representations from data, but their performance may be limited by the quality and diversity of training data. Hybrid methods aim to leverage the advantages of both semantic and embedding information, providing a more comprehensive understanding of classes. They often achieve improved generalization performance by integrating semantic knowledge into the embedding space. Meta-learning methods offer a promising avenue for zero-shot learning by learning to adapt to new classes with limited labeled data, but their applicability may be constrained by computational complexity and the availability of meta-training data.
Zero-shot learning (ZSL) has changed how we use machine learning, allowing us to work with new categories of data that models haven’t seen during training. This is crucial for fields where data is scarce and has been especially transformative across various domains. It is becoming a cornerstone in machine learning, providing solutions for dealing with scarce and complex data and promising advancements across many distinct fields of research. One significant advantage of zero-shot learning techniques is their ability to generalize to unseen classes, allowing models to recognize and classify objects or concepts not encountered during training. This capability is invaluable in scenarios where obtaining labeled data for all possible classes is impractical or costly. Additionally, zero-shot learning promotes model scalability by reducing the need for continuous retraining as new classes emerge. It fosters adaptability, making it suitable for dynamic environments.
The ongoing innovations in this area are fascinating and we believe that they will most likely lead to more refined and impactful tools and solutions in the very near future. Out of the 221 papers surveyed in this review, 42 employ generative modelling, 5 leverage attention-based mechanisms in ZSL, and 57 incorporate a blend of methodologies with zero-shot learning, highlighting the adaptability and escalating significance of these models in varied knowledge domains. Such versatile approaches are instrumental in managing complex or scarcely available data types, reflecting potential dominance in fields that struggle due to data limitations.
However, it should be noted that zero-shot learning-based methodologies are also riddled with challenges that need addressing in order to reach their full potential. A critical concern is enhancing scalability and generalization as it is difficult for most current ZSL methods to adapt to extensive datasets and to generalize to datasets exhibiting substantial divergence from the ones used for training. Addressing the handling of domain shift is equally crucial; many ZSL methods struggle to notice the small details and meanings in visual attributes of target data points, requiring better models to understand these small differences effectively.
Furthermore, mitigating bias in cross-modal learning is imperative. The representation of data from disparate domains in a unified embedding space often leads to the acquisition of incorrect correlations and associations. Therefore, refinements in cross-modal learning methodologies are essential to ensure accurate and unbiased learning outcomes.
Another inherent challenge is the anticipation of unseen classes during training. ZSL models, although adept at recognizing unseen classes, typically do not undergo explicit training on these classes independently, potentially compromising their performance on unseen data. The refinement of training methodologies to enable models to anticipate and adapt to unseen classes more effectively is a pressing need.
Despite these challenges, the versatility of zero-shot learning is unfolding unparalleled possibilities across diverse domains of knowledge. Its adaptability and the potential it holds, promise to discover applications well beyond the existing ones, thus influencing many different aspects of our lives. It is plausible that techniques inherent to ZSL could serve as catalysts for upcoming innovations, resulting in novel paradigms across diverse fields of research and enabling unprecedented advancements and solutions to enduring challenges and problems.
