Abstract
A scene graph generation is a structured way of representing the image in a graphical network and it is mostly used to describe a scene’s objects and attributes and the relationship between the objects in the image. Image retrieval, video captioning, image generation, specific relationship detection, task planning, and robot action predictions are among the many visual tasks that can benefit greatly from scene graph’s deep understanding and representation of the scene. Even though there are so many methods, in this review we considered 173 research articles concentrated on the generation of scene graph from complex scenes and the analysis was enabled on various scenarios and key points. Accordingly, this research will enable the categorization of the techniques employed for generating the scene graph from the complex scenes that were made based on structured based scene graph generation, Prior knowledge based scene graph generation, Deep understanding based scene graph generation, and optimization based scene graph generation. This survey is based on the research techniques, publication year, performance measures on the popular visual genome dataset, and achievements of the research methodologies toward the accurate generation of scene graph from complex scenes. Towards the end, it identified the research gaps and limitations of the procedures so that the inspirations for introducing an advanced strategy for empowering the advanced generation of graph scenes from the complex scene will the empowered.
Keywords
Introduction
Computer Vision is the field of computer science where it deals with the images related inputs. It has been showing its progressive research aspect in the past decades. It enables abstract computers to see and solve problems in any situation and constraints. It also enables the Artificial Intelligence level of knowledge gaining and processing to age the limit of the computers to jump to the next scope of possibilities. Since the computer vision is the core eye of the computer devices it proves the significant possible actions by the computers. Also, it is a vast and key area in Artificial Intelligence but it is quite an expensive area of research as well as in real-life utilization, because of different dependency factors, computer vision takes very highly challenging aspects.
Deep learning [142, 91, 26] is a branch of artificial intelligence that uses multiple layers of neural networks to learn structured and unstructured data. It is still an active and evolving field of research, with many open problems and opportunities like a deep understanding of the visual scene and knowledge retrieval [101]. Also, it has been applied to various domains such as natural language processing, computer vision, voice recognition, and knowledge generation. Some of the factors that have contributed to the success of deep learning are the availability of massive datasets, the development of powerful computing hardware, and the advancement of novel algorithms and architectures. Deep learning methods have achieved remarkable results on this task by leveraging convolutional neural networks (CNNs) [49, 5, 133] and graph neural networks (GNNs) [12, 171, 131] to encode the visual features and the graph structure. However, there are still several challenges and limitations that need to be addressed, such as the scalability of the models, the diversity and trustability of the predictions, and the generalization to novel scenes.
Scene graph generation [113, 92] is a task that aims to extract semantic and structural information from an image by representing the objects and their relationships as a graph. scene graph generation has been developed using deep learning techniques that leverage large-scale image-text datasets and powerful Neural Network architectures. In recent years, scene graph generation has achieved remarkable results in many computer vision tasks that require a high-level understanding of the image content, such as image classification [9, 19, 67, 33, 40], image captioning [119] and image recognition [137] scene graphs can help generate more descriptive and accurate captions by incorporating the object attributes and relationships, visual question answering scene graph can help answer complex questions that involve reasoning over multiple objects and their interactions, image retrieval [21] scene graph can help retrieve images that match a given query based on the semantic and spatial criteria. Scene graphs can also facilitate other tasks such as knowledge representation [53], machine translation [106], and image summarization.
The scene graph depicts the relationships between the objects as well as their locations and categories. Numerous researchers have recently paid attention to inferring scene graph because it enhances the rich semantic connections between the objects from the input data. In addition, scene graph provides a context guide for image recognition tasks, this ricker semantic understanding of the scene graph has broad high-level vision applications. scene graph generation or Visual Relationship Detection (VRD) is a significant stage in a scene comprehensive of visual items. Scene graph generation has shown a lot of interest in recent times because of its capacity to give an organized, complete image portrayal for significant-level visual thinking. scene graph generation enables adequate perception and comprehensive understanding of scenes, particularly for 3D real-world scenes [154]. As a result, both scene graph generation models and the matching performance measures do not help us identify whether the neural network models really learn to capture the semantics of relations or only learns to fit the partial remarks in the training information while scene graph generation [59].
Some of the recent methods use Transformers [155, 160, 172] to encode both visual and textual features and predict objects and relations via a masked token prediction task. Other methods use CNN to extract visual features and GNN to model the visual relationships between objects. Some methods also explore the use of semantic point clouds [156, 108] as input modality and generate 3D scene graphs [171, 6, 84, 19] that capture the spatial and semantic information of the scene. It is still a challenging task that requires bridging the gap between images and texts, managing open-vocabulary settings, and dealing with noisy or incomplete annotations. For image and text retrieval task scene graph provides rich contextual information that can improve the accuracy and diversity of the retrieved images and text. Image retrieval algorithms can assess the semantic similarity and relevance between the two modalities by matching the scene graph nodes and edges with the words and phrases in the text. For instance, a scene graph can capture the attributes, actions, and interactions of the objects in an image, which can help distinguish between different scenes or scenarios that share similar objects. Scene graphs can also help knowledge retrieval by enabling hierarchical and compositional reasoning that can manage complex and abstract queries. The attention mechanism focuses only on the required part of the input data and this type of model eliminates the unwanted data for the models. The model gains a deep understanding of the part of the data and by that, it tries to predict. It enhances the performance of the training period of the scene graph generation models. Due to recent developments of knowledge over and differential input formats deep learning models are not sustainable and trustable the predictions. The principal construction of scene graphs still faces many challenges such as a lack of deep understanding of the input data and connectivity between multiple instances. The major thing is different learning models have different sets of problem-specific impacts on the evaluation metrics like accuracy and completeness of the scene graphs by gaining some prior knowledge and attention based approaches. Accordingly, the review of this learning model is sufficient to know what the exact achievements have been done yet in this topic.
The main intention of this survey paper is to identify the various strategies used in view of scene graph generation. In this review, the techniques that are currently in use are categorized as structure based, prior knowledge based, deep understanding based, and optimization based. The observation of the year of publications, various datasets, software tools used, architecture categorized, and so on led to this review. Additionally, recall is the most frequently utilized performance metric for scene graph generation. A general identification of the drawback of study papers is provided by the represented research gaps. Consequently, the represented research gap is viewed as a resource for expanding one’s knowledge of scene graph generation. As can be seen, this review is presented in sequential order.
Section 2 demonstrates the categorization of different scene graph generation techniques. The categorized techniques of scene graph generation’s represented research are explained and analyzed in Section 3. An examination of scene graph generation methods and its evaluation measures are summarized in Section 4, and Section 5 has the applications and future trends of the scene graph generation. The conclusion of this survey is explained in Section 6.
Scene graph generation
A scene graph is a semantic structured representation of input images. Moreover, it engages the gap between visual and semantic perception of a visual scene and the major challenge task is to detect and predict the relationships between the objects and subjects for complex scenes. Scene graph generation is a task of detecting and predicting the visual relationships among the objects and subjects in an input image and which is represented in a triplet manner
Classification of scene graph generation (SGG) methods.
Representation is a crucial and risky task in scene graph generation to learn an effective and accurate manner from the image. Most of the recent works have shown perceptual structure such as in a 2D representation manner, whereas structure based scene graph generation represents the scene graph in a 3D particular domain space. 3D scene graphs are knowledge graphs used in computer graphics to represent complex 3D scenes.
Zachary et al. [171] developed a reinforcement learning framework using GNN architecture to learn navigation policies. The model embeds 3D scene graphs into agent-centric feature spaces, capturing occupancy and semantic content while retaining trajectory memory. The model improves object search behaviour, long-term memory, and navigation objectives. Aayush et al. [6] developed Sim-to-Real scene graph (Sim2SG), a method for simulation to real transfer learning for scene graph generation. This method addresses the domain gap between synthetic and real data by decomposing appearance, label, and prediction discrepancies. The model shows significant improvements in both qualitative and quantitative aspects and is validated on both toy and realistic simulators.
Yun et al. [149] proposed a framework for compressing 3D scene graphs under communication constraints using graph theoretic tools, specifically graph spanners. The compression strategies are navigation-oriented and preserve the shortest paths between locations. The effectiveness of the model is demonstrated through synthetic robot navigation experiments in a realistic simulator. The authors present two algorithms, which use graph spanners to prune nodes and edges from a 3D scene graph, retaining navigation-relevant information and imposing a user-specified compressed size.
Nathan Hughes et al. [84] the article presents Hydra, a real-time spatial perception system that builds 3D scene graphs from sensor data. The system uses real-time algorithms for mesh, object, and place layers, as well as a segmenting approach for rooms. It is implemented in a parallelized architecture, combining mid-level perception processes with slower high-level perceptions. The system’s loop closure detection algorithm optimizes generation. The system outperforms batch offline methods in reconstructing 3D scene graph. Antoni R et al. [8] introduced Kimera, a system using a 3D Dynamic Scene Graph [7] to bridge the gap between robot and human perception. The system includes visual-inertial Simultaneous Localization and Mapping (SLAM) [158], an object localization module, a human pose estimation module, a metric for semantic 3D reconstruction, and scene parsing. It performs well in real-time datasets and photo-realistic simulations, including uHumans2, and can be used for real-time hierarchical semantic path planning.
Yue et al. [156] the authors propose a task to localize 3D bounding box changes and describe scene changes. They create a simulated dataset and a framework that incorporates different 3D object detectors. The framework improves both change detection and captioning tasks. Pretraining on the proposed dataset increases change detection accuracy by
Wu et al. [120] developed scene graph Fusion, a method for incrementally building semantic scene graphs in 3D environments using Red Green Blue (RGB)-D frames. The method aggregates PointNet features and introduces a novel attention mechanism for partial and missing graph data. It outperforms other methods and achieves better performance with other 3D semantic and panoptic segmentation methods. Iro Ameni et al. [52] developed a semi-automatic framework for constructing a 3D scene graph with unified semantics, 3D spaces, and a camera. The framework uses framing and multi-view consistency to enhance detection methods and aggregate information in 3D. The results were better than recent works. Sangjin K et al. [108] developed a low-power GCN processor for real-time 3D point cloud semantic segmentation on mobile devices. The GCN processor features sparse grouping based dilated graph convolution, two-level pipeline (TLP), and center point feature reuse. It reduces computation, external memory access, and core utilization, making it suitable for mobile mixed-reality devices and low-power accelerators. Liu et al. [154] developed a 3D scene graph generation system for logical data, achieving fine-grained substance class, connection marks, and high accuracy. The system uses Graph Feature Extraction and Graph Contextual Reasoning modules, with a multi-task learning strategy [31].
Sarthak et al. [103] introduced SceneGraphGen, a deep auto-regressive model that learns probability distribution over labeled and directed graphs using a hierarchical recurrent architecture. The model generates diverse, semantically plausible scene graphs, outperforms Graph Recurrent Neural Network (RNN), and can transfer to novel images, detect unusual graphs, and extend incomplete graphs. Wu et al. [118] developed a hierarchical context based method to extract entity, global, and scene contexts from images, improving emotion recognition accuracy. The method was tested on various datasets, achieving an Emotion Recognition Score (ERS) of 0.2153. The method addresses challenges in emotion recognition on multilabel classification datasets.
Yu et al. [64] developed a debiasing Cognition Tree (CogTree) loss for unbiased scene graph generation. It focused on easy understanding relationships and identifying distinct ones. The loss was designed for this structure, improving model performance, and enabling accurate relationships to be distinguished from coarse to fine. However, the method lacks sufficient knowledge to optimize the CogTree structure. Wang et al. [134] developed a human-like Hierarchical Entity Tree (HET) for scene representation and developed a scene generation. Hybrid-Long Short-Term Memory (LSTM) was used to encode HET’s hierarchy and context, and a Relation Ranking Module was developed to dynamically adjust scene generation’s key relations. This method achieved high results in scene graph generation and image-specific relations, crucial for downstream tasks. Chuhao et al. [20] introduces a novel loop detection method for indoor simultaneous localization and mapping (SLAM). The method uses geometric information of indoor scenes to generate the scene generation. It matches the scene graphs based on their topology descriptors and volume similarity to find loops in the input. The experiment and result of the method is evaluated on real-world dataset [85] collected in dense indoor scenes and it outperforms the existing semantic-aided loop detection methods.
Prior knowledge based scene graph generation
Prior knowledge is knowledge gained from multiple or single contexts. This knowledge can be utilized to gain more knowledge about the problem from various aspects. Therefore, prior knowledge can be distinguished in various such as linguistic prior, visual prior, knowledge prior, context prior, etc. scene graph generation involves relationships as combinations of objects with a wider semantic space.
Li et al. [72] introduce model-agnostic label semantic knowledge distillation (LS-KD) for unbiased scene graph generation, addressing scene graph generation datasets with multiple predicates and missing annotations. It improved the performance by incorporating iterative and synchronous self- knowledge distillation strategies. Tianshui et al. [122] propose a system incorporating statistical correlations into deep neural networks for scene graph generation. This improves performance and addresses the uneven distribution of real-world relationships. The method has been tested on a large-scale visual genome dataset and has a new evaluation metric, mean recall. Khan et al. [82] developed a method to generate expressive scene graphs using commonsense knowledge that infuses for visual understanding and reasoning of the images. The method uses a heterogeneous knowledge source and graph embedding to improve performance and expressiveness in visual understanding and reasoning tasks.
Jiuxiang et al. [55] the article presents a novel scene graph generation algorithm that incorporates external knowledge and image reconstruction loss. It aims to improve generalizability in scene graph generation by incorporating commonsense knowledge from an external knowledge base and an auxiliary image reconstruction path. The proposed framework outperforms benchmark datasets. The knowledge-based feature refinement module refines object and subgraph features, while the image-level supervision module reconstructs images from detected objects. Jaewon et al. [97] propose a method for visual relationship detection, which involves detecting objects and classifying predicates. The authors address challenges like intra-class variance, long-tail distribution, and class overlapping. They use language and visual modules, spatial vectors, word vectors, and bounding boxes to improve performance. Their proposed spatial vector effectively detects unseen visual relationships without costly linguistic knowledge distillation or complex loss functions.
Wenbin et al. [133] propose a method for generating a topic scene graph by distilling attention from image captions. The method reduces trivial content and noise, transforming attention from individual objects to relational events. Haiyan Gao et al. [47] proposed the Balanced Award-Punishment Model (BAPM) to address long-tailed dataset biases using energy-based learning and causal reasoning. The model combines stochastic strategy, knowledge transfer, and lateral inhibition loss, achieving state-of-the-art performance on both quantitative and qualitative metrics. Jin Chen et al. [53] developed an scene graph generation model for unannotated videos, leveraging annotated images. They aimed to infer unseen dynamic relationships [152] and adapt objects and static relationships. To reduce their data distribution disparity between image and video frames, the method combines external common sense knowledge and hierarchical adversarial learning.
Yuyang et al. [96] present a Dual scene graph Convolutional Network (Dual-SGCN) method for predicting motivation using complex visual and semantic contexts. The method uses multi-task co-training and unbiased motivation inference. Bingqian et al. [17] the article discusses atom correlation based graph propagation (AG-GP) for scene graph generation to address long-tailed distribution in datasets. It focuses on diverse atom correlations, node feature initialization, graph propagation, and visual feature refinement, enhancing feature enhancement and promoting comprehensive knowledge. Runqing et al. [101] the article presents a planning system to enhance the performances of long-term manipulation tasks for autonomous robots, combining visual semantic understanding, regression planning, and multi-level representation using a knowledge base. The method improves success rates and generalization performance in simulation and real-world environments.
Zhanwen et al. [166] the article introduces a new approach to unbiasing scene graph generation models, called Explicit Ontological Adjustment (EOA), using a commonsense knowledge graph and edge matrix for improved relationship detection. Shixing et al. [105] the authors introduce Cross-modal processing and commonsense reasoning (CMCR) as a Dense video captioning model, combining visual and audio information. The model improves the event localization accuracy and generates more logical captions, outperforming state-of-the-art methods [38]. Gayoung et al. [42] the authors introduce Visual scene graph generation-Net, a deep neural network model for irrelevant pairs, extracting spatial-temporal context, and dealing with relationship class imbalance. The model achieves superior results on various datasets [93].
Zhang et al. [75] developed a knowledge-based model by adjusting visual contextual dependency and Relational proposal network [63], achieving better global and relevant object connections, outperforming other approach. Zareian et al. [10] developed a graph-based neural network (GB-Net), which iteratively spreads information between and within graphs, achieving high precision. Zareian et al. [11] developed a visual common-sense model for enhanced scene understanding using affordance and intuitive physics data, improving scene graph generation strategy precision. Tian et al. [159] developed road scene graph for intelligent vehicles, utilizing topological graphs for object suggestions and relationships, allowing easy processing. Tzu-Jui et al. [127] developed a novel scene graph generation framework using supervised and semi-supervised relation learners to reduce biases in incomplete annotations, capturing minority classes better, and compatible with existing models. Zhenxing et al. [172] developed a novel subgraph-based object context-masked network (SOCNet) for scene graph generation for better performance on challenging datasets.
Aditya et al. [9] developed the Visio-Lingual Message Passing GNN (VL-MPAG Net) to localize objects and relationships in natural images. The approach uses three modules: Proposal Graph Generation (PGG), Structured Graph Learning (SGL), and Joint Proposal Scoring. It outperforms baselines and is compared on four public datasets. Zhang et al. [164] develop Saliency-Guided Message Passing (SMP) for visual relation saliency, enhancing scene graph structure generation and generalizability in many applications like cross-model text retrieval and image captioning. Xu et al. [86] developed a multi-scale context modeling method for scene graph inference, combining object and region-centric contexts, and integrating context-fused inference.
Liu et al. [13] developed a scene-graph-guided message-passing network for dense captioning, achieving comparable results and overcoming difficulties in distinguishing relationships or amphibious objects. Li et al. [110] developed an attentive gated graph neural network using VRD to spread a message. The network uses edges as relationships and nodes as objects, using an attention mechanism to measure connectivity. The method’s efficacy was demonstrated through extensive testing on the widely used dataset but did not produce comparable outcomes.
Deep understanding based scene graph generation
A deep understanding of scene graph generation requires not only the ability to detect and classify objects and predicates but also the ability to capture the context and reasoning behind the scenes in the input data. It is an in-depth exploration of a subject that requires higher-order thinking and critical analysis. The attention mechanism [1] is creating a new scope of view in all the fields of Artificial Intelligence, as it allows models to focus on the important part of the inputs according to the guidance.
Gao et al. [68] propose a video scene graph generation framework using a transformer-based encoder-decoder structure and role-aware cross-attention module. They introduced BIG, which uses a video grounding model and extends it to manage multiple instances of predicates with different time slots. The framework achieves superior performance on benchmarks, demonstrating its effectiveness in temporal bipartite graph formulation. Lin et al. [143] developed a Graph Property Sensing Network (GPS-NET) for scene graph generation that explores edge direction, node priority, and long-tailed relationship distribution. It outperforms existing datasets and effectively captures scene graph properties and handles scene graph generation class imbalance problems. Zhiyuan et al. [169] this article introduces an remote sensing image scene graph generation by fusing contextual information and statistical knowledge (RSSGG_CS) model for scene graph generation of remote sensing images that combines contextual and statistical knowledge to improve feature extraction and relationship prediction. The S2SG dataset, the first dataset of RSSGG, shows that fusing contextual information suppresses object pairs without semantic relationships, reduces search space of relationship predicates, makes the model output more consistent with real-world relationships, and facilitates commonsense reasoning.
Lu et al. [155] propose a scene graph generation model that predicts visual relationships in images sequentially and conditionally. The model uses transformer-based encoder-decoder architecture, reinforcement learning, and sequential conditioning to resolve ambiguities and reduce bias. It achieves strong generalization and robustness. Dong et al. [96] introduce a part-and-sum transformer for visual composite set detection, utilizing composite queries, part-sum interaction, and factorized self-attention layer. Part-and-sum transformer outperforms custom two-stage methods on VRD and Human-Object Interaction (HOI) detection and is generalizable to other two-level hierarchical tasks. Zhecan et al. [173] developed a scene graph based Enhanced Image-Text Learning (SGEITL) using visual scene graphs in multimodal transformers. The model includes various steps for improving performance on various datasets.
Suprosanna et al. [115] presented Relation former as a one-stage transformer-based model for image-to-graph generation, capable of handling tasks like pathway network extraction, graph extraction, and scene graph generation. It uses
Wang et al. [25] developed a novel scene graph generation framework using Transformer networks to convert image data into linguistic descriptions of objects and their relationships. The framework improves on datasets, to enhance the understanding of Computer Vision and Natural Language Processing (NLP) tasks. Yuren et al. [150] developed a Relation Transformer (RelTR) model for scene graph generation, predicting subjects-predicate-object triplets using attention mechanisms. The model outperforms state-of-the-art methods on various datasets. Rajat et al. [99] presented a relation Transformer architecture for scene graph generation, capturing contextual dependencies and predicting relationships using complex global object interactions and a positional encoding algorithm. Pawit et al. [89] article investigates scene graphs for autonomous driving and driver-action prediction using a self-supervision pipeline. It incorporates attention mechanisms to create heatmaps and enhance interpretability. The system outperforms fully supervised approaches.
Xingning et al. [137] propose an unbiased scene graph generation method using Stacked Hybrid-Attention (SHA) Net and Group Collaborative Learning (GCL) strategy. The SHA Net strengthens the encoder, while GCL optimizes the decoder, mitigating biases and compensating for under-fitting. This approach significantly improves performance on various datasets. Li et al. [168] developed a Dual Attention Messaging Passing (DAMP) model for scene graph generation, addressing imbalanced information transmission. The model integrates internal and external attention mechanisms, regulating information transmission efficiency and improving scene graph generation performance, achieving comparable results on VRD datasets. Liu et al. [12] presented a region-aware attention learning method for scene graph generation, that constructs an attention space to identify the regions of objects and relationship predicates. The method incorporates object and predicate-wise attention GNNs and intra and inter-triplet learning mechanisms and this method outperforms on various datasets. Li et al. [90]introduced a novel multi-scale semantic fusion network (MSFN) for remote sensing scene understanding. The MSFN framework includes object detection, Sparse Relationship Extraction Network, and Multi-Scale Graph Convolutional Network. It effectively integrates semantic content and predicts potential relationships, outperforming the Remote Sensing scene graph dataset. Shanshan et al. [119] developed a novel image captioning model using multi-level alignment, semantic knowledge, and spatial relationships for achieving the best results.
Yao et al. [162] the article propose a novel deep learning architecture called Multihub Driven Attention Network (MHDANet), a novel deep learning architecture, improves scene graph generation digital twin tasks by focusing on valid connections, classifying objects, and predicting relationships, achieving the best performance on various datasets. Zhou et al. [51] developed a deep sparse-based graph attention network for scene graph generation, aiming for effective message passing and contextual learning. The model uses Faster R-CNN, Relationship Measurement Network, and Graph Attention Network (GAT) for contextual learning and object classification. i Qi et al. [79] propose a novel Attentive Relational Network for scene graph generation, consisting of object detection, semantic transformation, graph self-attention, and relation inference modules. Mi et al. [73] developed a model for Visual Relationship Detection, capturing object-level and triplet-level dependencies, and outperforming existing methods on VRD datasets. Tian et al. [94] introduce a multi-level semantic task generation network that jointly refines the features of different levels of semantics in the input image by using message-passing techniques and the experiment on two datasets shows the performance of the model is better on different vision task generations.
Hanbit et al. [48] introduce a Multi-Scale Contrastive Learning approach for complex scene generation, improving discriminative ability through locally defined pretext tasks and enhancing local representation at multiple scales. Hua et al. [139] proposed a video captioning model that combines Adversarial Reinforcement Learning and object-subject relational graph. The model extracts motion and attributes information, analyzes object motion, and connects visual content and language. It uses an Adversarial Reinforcement Learning method and multi-discriminator to learn relationships between visual content and words. The method adapts to various application scenarios and achieves satisfactory performance. Tathagat et al. [126] developed a Variational Autoencoder (VARSCENE) model to generate graphs with minimal distribution discrepancy and introduce plausible variations. The encoder embeds the graph into semantic representation vectors, while the decoder generates scene graphs by learning components of nodes and edges. It accurately mimics the underlying distribution of scene graphs in experiments.
Fu et al. [167] proposed a method that addresses the limitations of existing scene graph generation methods when the image lacks enough visual contexts. This methodology transforms textual information into contextualized knowledge that is supported by visual items that enhance contexts. Chu et al. [87] discussed a Trans Multi-Object Tracking (MOT) based system for robust target association in tracking modules, using a spatial-temporal graph transformer. The system models spatial-temporal relations using sparse weighted graphs, estimating association from loosely filtered detection, and enhancing MOT in complex scenes. The system is evaluated on multiple versions of MOT datasets like MOT15 [71], MOT16 [4], MOT17, and MOT20 [88] with the best results. TransMOT can combine output from a generic image object detector with learnable detectors such as Detection Transformer (DETR) [151] to form a fully end-to-end tracker.
Charulata et al. [23] introduce Multiple Attribute Detector (MAD) modules for capturing structured attribute information in objects, integrating with existing scene graph generation frameworks without altering relation detection, and outperforming various datasets. Wang et al. [62] propose a scene graph-driven Multi-modal Multi-granularity Multi-task learning (M3S) framework for Multi-modal Named Entity Recognition (MNER), aiming to improve visual and textual information utilization. The framework uses a novel multi-task approach, including Named Entity Segmentation and Categorization, and uses scene graphs for modeling objects and relationships. The Multi-granularity Gated Aggregation (MGA) mechanism captures inter-modal interactions and extracts critical features for named entity recognition. Zhecan et al. [172] developed a scene graph Enhanced Image-Text Learning (SGEITL) for commonsense reasoning of images, incorporating visual scene graphs and a multi-hop graph transformer. The framework outperforms various datasets, leveraging structure knowledge extracted from visual scene graphs.
Jingwen et al. [54] present a Multimodal Graph Inference Network (MGIN) for scene graph generation, using Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI). MGIN enhances inference capability for triplets, particularly for uncommon samples. The MGIN module incorporates statistical knowledge, while TMFI combines visual and semantic features for efficient prediction. Lyu et al. [36] proposed a weakly supervised visual-textual (vtGraphNet) scene graph for complex visual grounding. The model learns the Bi-modal scene graph, attribute-assigning and relationship-referring models, and a graph consistency loss function. Validated on the dataset, it outperforms state-of-the-art methods in handling simple and complex visual grounding tasks.
Optimization based scene graph generation
Optimization based scene graph generations are the most challenging part and in this part, much of the recent popular research has enhanced the performance of constructing the scene graphs from the base input. Based on the optimization technique we have collected 29 papers from this list we have chosen the most recently published research papers for the discussion.
Qiu et al. [157] proposed a SGTracker a tracking-based approach that incorporates temporal and spatial contexts. It tracks objects and determines object and predicate labels. SGTracker outperformed existing methods in scene graph localization on the Virtual Home Action Genome (VirtualHAG) dataset, which includes per-frame consistent annotations and relationships requiring both spatial and temporal context. The experiment demonstrated the efficacy of pre-training on the proposed dataset and its potential in real-world scenarios. Swarnendu et al. [104] the authors propose a novel Im2Graph model for scene graph generation that consists of two phases: extracting image captions from local regions, creating a conceptual graph generation algorithm, and optimally combining them.
The model is constrained-based [28] and adheres to principal Explainable Artificial Intelligence principles. It can be plugged into any proposition and caption generation system and used instantaneously. Experiments on the dataset show rich concept graphs can be generated without explicit graph-based supervision. Zhang et al. [65] propose an efficient scene graph generator that considers visual, spatial, and semantic features using a late fusion strategy. It investigates the key factors impacting performance and visualizes learned visual features for relationships. The model’s efficacy is also examined for its effectiveness on an open-image dataset. Lin et al. [144] proposed a subset matching network (SM-Net) that handles the occlusion problem in complex scenes. The network decomposes the scene graph into node subset and edge subset and jointly predicts their categories. They evaluated the SM-Net on various datasets with reasonable results and it shows the robustness to occlusion.
Zhiyuan et al. [170] the article introduces a segmentation-based model for generating remote sensing image scene graph generation (SRSG), generating more accurate results for scene graph with segmentation. The model embeds morphological features and maps them to semantic space, resulting in a new dataset called segmentation results to scene graphs (S2SG). The SRSG model outperforms previous methods in generating remote sensing image scene graphs. Zhuoyue et al. [173] develop a semantic description for monocular digestive endoscopy using a scene graph and clustering algorithm, enhancing feature matching accuracy and robustness. He et al. [123] propose a Decomposition and Composition (DeC) method to avoid biased predictions in scene graph generation by decoupling visual features into intrinsic and relation-dependent components, improving performance on Visual Genome (VG) and Genome Question Answering (GQA) datasets. Sangmin et al. [117] developed a new framework for scene graph generation, addressing challenges like ambiguity, asymmetry, and higher-order contexts. It uses local interaction heads, direction-sensitive encoding, and Bi-directional Relationship Classification for better prediction.
Zero-shot learning for scene graph generation is a crucial task that aims to produce a structured representation of the objects and their relations in an image, without requiring any training data for the target classes. This technique could leverage the result for various applications such as image captioning [119], visual question answering [130], and image retrieval [58]. It improves the losses for novel compositions of scene graph generation [15] and the performance of scene graph generation methods in predicting zero-shot mode. This can be achieved by using a knowledge graph completion strategy [148] to generate the missing information of zero-shot triples and then integrate it with the visual features of images. Integrating common sense knowledge for scene graph generation [140], particularly for zero-shot relation prediction, provides an alternative strategy. This can be accomplished by creating graph mining pipelines and integrating them on top of cutting-edge scene graph generation frameworks to model the neighborhoods and pathways around items in an external commonsense knowledge graph. Motoharu et al. [82] propose open-set scene graph generation for detecting unknown objects and their relationships, extending scene graph generation’s applicability in real-world situations, comparing existing methods, and addressing limitations.
Li et al. [153] introduce a Lexical Knowledge-aware Memory Network (LKMN) for zero-shot relationship prediction in scene graph parsing. The method reduces intra-class variation by distilling linguistic knowledge of different objects. The LKMN’s effectiveness is evaluated on the dataset, showing effectiveness in zero-shot, few-shot, and supervised settings. Guo et al. [151] developed a model for one-shot scene graph generation tasks using Multiple Structured and Commonsense Knowledge. They used the Instance Relation Transformer encoder to explore the visual entity information. Their experimental results outperformed existing methods for multiple structured knowledge. Hengyue et al. [49] present a Fully Convolutional scene graph generation model that detects and predicts objects and relations simultaneously. Using a bottom-up method, relationships are encoded as 2D vector fields called Relation Affinity Fields (RAF) and objects are encoded as bounding box center points. Extensive experiments show efficacy, efficiency, and generalizability, with competitive recall and zero-shot recall, and reduced inference time.
Zhi et al. [171] propose Union Visual Translation Embedding (UVTransE) for VRD and scene graph generation, an extension of the VTransE method. That incorporates contextual information and maps entities and predicates into a low-dimensional semantic space. They used a recurrent-based language model that outperforms previous translation-based models. Nikolaos et al. [83] introduce an adaptive local-context-aware classifier based on object categories, outperforming most approaches. They mine and learn predicate synonyms, apply distillation-like loss, and evaluate the model on various datasets, showing superior performance.
Bibliographic analysis
In this section, we have analyzed the scene graph generation topic under various aspects such as year of publication, valuation based on implementation tools, dataset-based analysis, different neural network architectures for scene graph generation, and the evaluation of performance measures.
Analysis based on dataset
In this part, based on the analysis we provide of the most popularly used datasets for the scene graph generation task. Figure 2 depicts different datasets used for the generation of scene graph. According to the analysis, Visual Genome (VG) [100] is the most frequently used dataset. Based on the analysis Fig. 2 shows the most popularly used datasets for scene graph generation and the visual genome dataset proves the most prominent dataset for scene graphs. From the analysis, we grouped the datasets into five major categories: 2D images, videos, 3D representations, simulated, and custom-built datasets.
Analysis on dataset.
Most of the research on VRD and scene graph generation has focused on 2D images based on their problems. Consequently, several 2D image datasets are accessible, and Table 2 summarises their statistics. Some of the most frequently utilized datasets in this investigation are listed below.
Microsoft common objects in context (MS-COCO)
MS-COCO [166] provides images that detailed everyday scenarios with typical objects and subjects around the environment. This advances the development of object segmentation, recognition, and detection tasks. It contains a total of 2.5 million labeled instances in 328k images and the dataset is unique in that instance-level segmentation masks are annotated, enabling more precise detector evaluation. And it contains a rich level of contextual knowledge about the 80+ object categories [153, 30, 76, 161, 173, 32, 106, 11, 21, 34] following articles models are trained under this dataset and achieved comparable results over the different independent tasks.
Visual genome (VG)
VG [100] dataset gathers detailed annotations of the objects, properties, and relationships inside each image with the aim of enabling the modeling of relationships between the objects in images. It contains more than 100k images with an average of 21 objects, 18 attributes, and 18 pair wise relationships between the objects in each image. The VG dataset represents the densest and largest dataset for image captioning and visual question-answer tasks. Many prior works have investigated automatic methods (such as merging and filtering) for clearing up objects and relations in the image annotations and created their own visual genome versions to reduce the noise. In this review, a total of 76 research articles use the VG dataset for training. From that, some of them are subsampled and mixed with other datasets like VRD, COCO, AG, GQA, and Open Images. These are represented in Table 1 as follows.
Combination of VG dataset and other dataset counts
Combination of VG dataset and other dataset counts
Of these, one-shot VG [151], VG-KR and VG150 [164], VG-R10 and VG-A16 [121], VG-DR-NET and VG-MSDN [88, 138], VG-KR [69], VG200 [50], sVG [14] have released cleaned annotated versions. And other works [139, 27, 92, 76, 118, 173, 87, 48, 62, 9, 131, 129, 125, 52, 105, 39, 21] use problem-specific splits and modifications, and they disabled the direct comparisons with the existing experiments.
Genome question answering (GQA)
GQA [30] is a new dataset that aims to measure visual understanding in computer vision and improve visual question answering. It features compositional questions over real-world images and scene graphs of their objects, attributes, and relations. Each question has a functional program that outlines the phases in the reasoning process as well as a structured description of its semantics. Additionally, the dataset offers fresh measures for evaluating the precision, consistency, reliability, and plausibility of model predictions. The questions are grammatical, diverse, and idiomatic, and are based on natural-language crowd sourced scene graphs. The dataset is designed to be balanced and clean and to avoid language and world priors. In the [23, 130] research articles use this dataset to train their model and the model achieved comparable results over the different independent tasks.
Visual relationship detection (VRD)
VRD [22] dataset is the most promising benchmark dataset for the scene graph generation task. It taps the problem of long-tailed distribution of infrequent relationships among the dataset and it contains 5k images that have 6672 diverse types of relationships. The relations are widely fit into categories, such as action, spatial, verbal, preposition, and comparatives. In this analysis [171, 168, 49, 58, 96, 97, 109] these articles comparably utilized the dataset, and their model’s performance has been achieved the best results.
Action genome (AG)
AG [56] dataset represents actions as compositions of spatio-temporal scene graphs. AG captures the changes between objects and their pairwise relationships of the input frames. It contains 10000 videos with 0.4 million objects and 1.7 million relationships. And it provides frame-level scene graph labels for the components of each action. AG is the first and benchmark video-based database and it provides both action and spatio-temporal scene graph labels. According to the analysis [48, 9, 74] articles have used this action genome dataset for their problem, and they achieved better results over different sets of settings.
Video visual relation (VidVRD)
VidVRD [146] is a dataset that extracts instances of visual relations of interest in a video. A visual relation instance is represented by the relation triplet made up of the subject and object motions. The dataset contains 1k videos with clear visual relations, and it covers common subjects/objects of 35 categories and predicates of 132 categories.
Video object relation (VidOR)
VidOR [145] dataset contains 10k videos gathered from the YFCC100M collection together with many fine-grained annotations for relation understanding. Most of the objects are annotated with bounding-box trajectories to identify their spatiotemporal placement in the films. This results in the annotation of about 5K objects and 38K relation instances. According to our investigation [68, 53, 42] used these two datasets for their problem-specific estimation and outperformed other state-of-the-art models in terms of performance.
3D Datasets
3DSGG is a dataset that provides 3D semantic scene graphs for large-scale 3D reconstructions of indoor environments. The scanned environment creates semantic graphs that contain nodes as 3D objects and edges as semantic connections. The dataset was first used in [61] and it was generated in a semi-automatic way. These [61, 120, 19, 108, 154] articles have used this dataset for their training of the model, and they achieved better results over the existing methods and retrieval tasks.
Simulated datasets
According to these [171, 6, 149, 84, 8, 77, 156, 129, 101] research articles the authors have created their own datasets for their specific task by simulators. Lack of knowledge and data availability they simulated the datasets for their requirements, and they published these datasets open to all for future works. Most datasets in our analysis are simulated by using the unity-based simulator to create their environments. And also, they are randomly placing the targets and obstacles in the environment for the model to train under any circumstances in the real world. This kind of generated samples comparatively requires minimal cost for the generation and training of high-performance models. From the analysis [149, 118, 77, 156, 101] have created their own environment for the robot’s actions estimations and movement predictions.
Custom-built datasets
From this analysis [139, 27, 93, 76, 118, 170, 87, 48, 131, 95, 129, 125, 52, 105, 40, 21] articles are created their own dataset according to their task-specific requirements. Some of them are sourced from outer bodies such as NGOs, and organizational sites. And some of them populated the dataset by their agents according to their task-specific problems. Comparatively, each article achieved better results with respect to their problem needs.
The statistics of common scene graph datasets
The statistics of common scene graph datasets
NB. “–” indicates that this attribute is not released.
Regarding this survey, 170+ papers of scene graph generation were taken into long stretches of distribution. In Fig. 3, the various distribution year of this survey was demonstrated. This shows that one article was taken in the year 2013, 2 articles were taken in the years of 2015 and 2017, In 2018 and 2019, 7 and 8 articles were taken, 27 and 28 papers are obtained in 2020 and 2021, 47 articles are taken from 2022, and 21 research papers are taken in the year of 2023. In 2022, additional studies [135, 45] on scene graph generation were published from 191 and 138 reviewed papers.
The analysis on the publication year.
The standard approaches to analyzing software tools are discussed in this section. The implementation tools utilized for efficient scene graph generation are discussed in depth in Fig. 6. In the analyzed research papers, the different implementation tools are PyTorch, Python libraries, CUDA, TensorFlow, and CUDNN. Figure 4 shows that PyTorch is the tool that is used the most frequently for scene graph generation.
Analysis based on implementation tools.
This subsection displays the various Neural Network architectures used for scene graph generation of review analysis. Figure 5 illustrates the different architectures that have been used for the scene graph generation. From the analysis, it shows that Graph Convolutional Networks (GCN), and Fast Region Convolutional Neural Networks (FRCNN) are the most popularly used networks have been utilized for the generation of scene graphs.
Category analysis.
According to the survey, we have collected and analyzed some of the standard evaluation techniques and standards for the scene graph generation task in this part. Then, we presented the quantitative findings of each advanced model on the popular VG dataset.
Methods for evaluation
Scene graph generation requires problem-specific evaluation methods to analyze the performance of the model. The analyses have been categorized into four major methods of this survey. This section describes the common evaluation tasks for scene graph generation as follows:
Predicate classification (Pred cls.) [29, 82]: It determines which pairs interact and classifies the predicate of each pair using a set of localized objects with category names.
Scene Graph Classification (SG cls.) [29]: From a given set of localized objects, it predicts the relationship and object categories of the subject and object in each pairwise connection.
Scene Graph Generation (SGGen.) [29]: It predicts the predicate between each pair of objects observed in a set. It is more like phrase detection, but without subjects and the objects bounding boxes must at least partially overlap with their respective ground truths. Because SGGen only receives one complete triplet, the results cannot accurately express the detection effects of each element in the entire scene graphs.
Common metrics
Recall@K. [122] It measures the amount of true positive prediction among all the possible positive predictions. The existing metric for evaluating scene graph generation is the image-level recall, which calculates the true relationship which predicted from the top K confident relationship predictions. Where K is the maximum prediction allowed per object pair. Some research has computed the R@K with a constraint for only one relationship obtained from the object pair. While the unconstrained metric evaluates the model’s reliability.
Precision@K. [153] In the VRD task, it uses Precision@K to measure tagging accuracy. In OpenImages VRD Challenge, results are evaluated using Recall@50, mean AP of relations (mAPrel), and mean AP of phrases (mAPphr). mAP is a strict metric, penalizing predictions if no ground truth annotation exists.
Zero-Shot Recall@K [142, 97] In these research articles they proposed zero-shot relationship learning to assess model extensibility in real-world long-tailed relationships. And a single wRtr@K value can determine the zero or few-shot performance which is linearly aggregated for all
Quantitative performance
We present the quantitative result comparison of the scene graph generation methods with graph constraints on the VG [100] dataset. “–” indicates that there is no evaluation of that particular parameter. Some of the methods employ multiple methodology approaches, problem-specific networks, or priors; these are grouped based on the most prominent methodology approach they employ. Tables 3–6 show the evaluation using the range of recall for prior knowledge based, structure based, deep understanding based, and performance based scene graph generation, respectively.
An evaluation using the range of recall for prior knowledge based scene graph generation
An evaluation using the range of recall for prior knowledge based scene graph generation
An evaluation using the range of recall for structure based scene graph generation
An evaluation using the range of recall for deep understanding based scene graph generation
An evaluation using the range of recall for performance based scene graph generation
Applications of scene graph generation
Generative scene graphs are a way of representing the spatial and semantic relationships between objects in a scene. They can be used for various tasks such as image synthesis, scene understanding, and image captioning. A generative scene graph consists of nodes that represent objects and edges that represent relations. For example, a scene graph for an image of a person riding a bike on a road could have nodes for the person, bike, road, and sky, and edges for riding, on, and above. To generate a scene graph from an image, one can use a neural network that predicts the objects and their attributes, as well as the relations and their types. To generate an image from a scene graph, one can use a Graph Neural Network that takes the scene graph as input and outputs an image that matches the given description. These kinds of generation methods could improve image search engines and image generation for the training purposes of the model according to the user requirements. This could reduce the time for the searching experience on the internet and the performance of the retrieval of the image much simpler and easier for the systems. Based on our analysis Fig. 6 shows the generation of different data by the single centralized scene graph generator and we have segregated the types based on the data types such as text, image, and video by scene graph generation methods.
Text generation by scene graph
Scene graph generation is a state-of-the-art system that can generate natural language texts from keywords, prompts, queries, or scene graphs. It uses a large-scale neural network model that is trained on a diverse corpus of texts from various domains and languages. It can produce texts with different distinct characteristics such as context, tone, length, format, and style, by adjusting its parameters and hyperparameters. It can also suggest the users for writing and improving the quality of the creations according to their contexts. It enables the productivity of the users to be more creative and deeper informative. In some applications in our daily routines, we are using these kinds of text recommendations or generation based on the user’s keywords. According to our survey, we have the following research articles based on text generation.
Example scene graphs data generations.
Lawrence et al. [27] developed a method to learn visual features for semantic phrases from sentences using Conditional Random Field (CRF) [112, 24, 132] formulation. The model extracts predicate tuples containing nouns and relations and determines the CRF’s potential using the extracted sentences. CRF is a statistical modeling method that combines classification and graphical modeling, leveraging multivariate data and input features for prediction. The method generates scenes based on semantic meaning, making it a significant step forward in computer vision. CRF is also used as a metric to score a set of scenes for a text-based image retrieval task [116]. Sahand et al. [114] the authors present a scene graph classification framework trained on annotated images and symbolic data. scene graph classification classifies objects and their relations using a text-to-graph module. The model adjusts the classification pipeline with text knowledge, generating more precise results in scene graph classification, object classification, and predicate classification. Maximilian et al. [78] the article introduces a model that uses detected objects and auto-generated visual relationships to recognize images in natural language. The model recognizes individual components and their visual relationships, producing a scene graph from raw image pixels. The final caption is produced by the graph-to-text model using the scene graph as input. The system has two parts one is the scene graph generation module which generates the scene graphs and the graph-to-text part which uses the attention mechanism based LSTM decoder. The model’s superiority over conventional image captioning approaches is demonstrated in the newly generated dataset.
Video generation by scene graph generation is a new technique that creates realistic, diverse videos from image and text keywords. It guides video generation by scene structure and context, resulting in more coherent, consistent, and natural videos. This technique has potential applications in entertainment, education, and security, enhancing learning and teaching, and enhancing surveillance and forensics.
Shengyu et al. [102] introduced the Dynamic Scene Graph Detection Transformer (DSG-DETR) method for generating dynamic scene graphs from videos. The method captures long-term temporal dependencies between objects and their relationships, using transformers and modelling relationship transitions. Experimental results show that DSG-DETR outperforms the existing methods on the AG dataset. Prateksha et al. [95] the Recipe2Video model converts documents into multimodal illustrative videos, improving user consumption experience. The model uses re-ranking and retrieval methods to select the best images for the recipes. They incorporated a Viterbi-based optimization algorithm [46] to create videos with visual cues, text, and voice-overs. The model captures semantic and sequential information and optimizes performance for seamless transitioning videos. Nag, Sayak et al. [111] introduce a TEMPURA framework for generating balanced scene graphs from videos. They addressed challenges like long-tailed distribution, noisy annotations, and temporal fluctuations. They incorporated a Mixture Density Network on top of the neural network which increases predictive uncertainty and noise in data. They also introduce a memory-guided training strategy to debias the predicate embeddings. Chen et al. [70] developed a new method for STSGG using a Slow-Fast Local-Aware Attention network it addresses the issues like the inability to distinguish between dynamic and static relations and inaccurate tail predicate classifications. The method achieves state-of-the-art results on the AG and ImageNet Video datasets, enhancing feature discrimination and potential applications in computer vision, robotics, and Artificial Intelligence.
Image generation by scene graph generation
Image generation using scene graphs is a challenging task that aims to create realistic and diverse images from structured representations of objects and their relationships. The main challenge is preserving consistency between input and output images, respecting attributes, spatial arrangements, and complex interactions. There are two main methods: direct and indirect. Both methods have advantages and disadvantages. Image generation is useful for various applications, including computer vision, computer graphics, Natural Language Processing (NLP), and Augmented Reality (AR)/Virtual Reality (VR), where scene graphs provide a high-level understanding of visual content and enable tasks like image retrieval, scene parsing, Video Graphics Array (VGA), and computer graphics.
Liu et al. [35, 34] the article is about Scene Sketcher a GCN-based architecture for fine-grained scene-level sketch-based image retrieval (SBIR). It generates realistic scenes from simple sketches, adds colors, textures, and lighting effects, and fuses multi-modality information between query sketches and target images. The model is trained using triplet and end-to-end methods, and its flexible graph feature learning allows for generalization to different scene data. SceneSketcher is ideal for artists, designers, students, and anyone seeking to unleash their creativity. Rishi et al. [98] article introduces a new scene expansion task using an auto-regressive model called the Graph Expansion Model for Scenes (GEMS). GEMS generates hierarchically dependent nodes and edges and introduces a cluster aware Breadth First Search (BFS) method for object co-occurrence. Experiments show GEMS outperforms graph synthesis tasks and GraphRNN based models, providing creative photographers with recommendations for diverse, rich scenes with desired seed concepts. Vivek et al. [131] the authors introduce CNN2GNN and CNN2Transformer, two methods for image classification using inter-example information. They generate a latent space bipartite graph using GNNs to calculate cross-attention scores between input images and a proxy set. Proxy sets contrastively learn class-level global information and are incorporated into feature representations. The CNN2GNN method improves image classification performance, allowing graph construction from arbitrary datasets and using proxies for class-level global information. This approach is useful in various applications, including object recognition, scene understanding, and image retrieval. Umair Hassan et al. [81] the article compares scene graph and layout-based image generation models, revealing that layout-based models generate more realistic, detailed images, better capturing spatial relationships and interactions. Image generation from scene graphs and layouts has a wide range of real-time applications. Some of these include:
Content Creation – These models can be used to generate images for use in advertising, marketing, and other creative industries. Virtual and Augmented Reality – Creating a realistic virtual environment based on the user’s perceptions and needs. Gaming – These models can be used to generate realistic game environments and characters. Education – creating educational materials such as diagrams and illustrations.
Gaurav et al. [43] the authors developed a method to generate images incrementally using graphs of scene descriptions, preserving context, and generating consistent images over time. This approach generates high-quality real-world scenes with multiple objects, impacting fields like Robotics, Artificial Intelligence, Design, and Image retrieval. It doesn’t require intermediate supervision and is applicable to real-world images, making it useful for various applications. Azade et al. [2] introduce a meta-learning approach that adapts a model to different scenes and improves the image quality on a diverse variety of tasks. The experimental results show the performance of the model in image quality and semantic relationships. Chenyang et al. [21] present a CNN that learns image-to-graph translation tasks without external supervision using a self-supervised approach. This self-supervised approach encodes graph nodes and edges, offering benefits for intelligent agents in scene understanding and high-level reasoning. Kim et al. [109] the authors propose a semantic scene graph generation method using the Resource Description Framework (RDF) model and deep learning techniques. This approach clarifies semantic relations between objects in images by enabling efficient finding and classification algorithms with addition of meaningful information on image content as nodes and edges of a graph. Justin et al. [57, 58] the article presents a flexible, semantic-based method for retrieving images using scene graphs, improving object localization and accuracy in various applications, including digital assets management and visual search engines. Aiswarya et al. [121] state that the proposed framework generates the scene graph from images by using depth and spatial information of the object pairs. The framework predicts the relations and attributes of object pairs directly from images, without using any external knowledge or text descriptions. The major application benefits of this model are:
Improved image understanding – scene graphs provide a semantic structural representation of the objects, relationships, and attributes in an image, which improves the deeper understanding of the image with the relation connectivity across the objects.
Enhanced image retrieval – scene graphs can be used to effectively retrieve images based on constrained learning [28] or guided content. So, it taps the responsible prediction and recommendation of information for humans to get the appropriate results.
Improved image captioning – Because of deeper understanding and semantic structural representation of entities it is much more efficient for Artificial Intelligence models to describe the image with deep knowledge.
Overall application and use cases of the scene graph generation methods have been seamlessly advanced in the recent advancements in real-time applications such as task planning and action predictions for robots, topological land scanning for unknown space estimation, a conceptual reminder for many local tasks, and point cloud estimations. Because current semantic or estimation-based prediction models are not sufficient for many unhuman tasks and can predict wrongly. For example, recently in the Economic field of the world, many of the advanced risk prediction models have failed due to unknown factors like corona viruses which impact every single unit of industry in this world. In the future, scene graph techniques could elevate the performance and possibilities in every Artificial Intelligence-involved field like automation industries, product development industries, and Bio-medical industries could use this technique to enhance the production of their products in a very effective and efficient manner for a sustainable period.
Scene graph generation’s major focus is to extract the relationship among the different entities from the input data and represent them in a graph structure. Currently, scene graph generation has a lot of research in multiple aspects of work like optimizing the existing and using the techniques in other domain problems. But still, there are many directions of worthy attention needed to use this technique for trustworthy prediction for long time periods. Based on the analysis and gaps in the research works it has many significant roles in all aspects of the domains such as:
Deep image understanding tasks like image generation based on users’ inputs, image captioning, image retrieval, visual reasoning, and dynamic generation of image structures from construction places. Dynamic action based tasks such as robot navigation predictions, dynamic task planning for modern robots, and dynamic action estimations. 3D virtual environment creation for the robot’s training [37] and simulation tasks such as automatic evaluation of products, and task completions by robots. Knowledge and drug discovery by using graph formations and relationship prediction stores every single unit of inputs gathered from the sources. It enables the possibilities for the unknown place evaluation and action predictions based on the gained knowledge. Social relationship detection is for detecting human-object and human-human interactions is crucial for scene graphs, and these relationships can be extended to detect social relationships. This research direction aims to understand scenes more deeply, and scene graph generation models can mine unseen social relationships from large-scale datasets, offering practical applications. Multi-modal ability to handle multiple tasks of taking different sources of types of data for the generation of common entity graphs can significantly balance the distribution and challenges of the data types. Pei et al. [60] have done cross-modality-based attention framework that can match the text with the scene inferences of the images. This technique will enable the centralized formation of knowledge for the organizations. Modern methods for scene graph generation Mainstream scene graph generation methods rely on object classification, detection, and recognition. However, current scene graph datasets and relationship prediction models face limitations. To improve prediction abilities, online learning, reinforcement learning, active learning, language model integration, and explainability of the predictions could be introduced into future scene graph generation methods. These kinds of strategies will enable the trustability and responsible way of the predictions while deploying in the real world.
Conclusion
The study of scene graphs is expanding quickly, and there are many potential applications. It tries to enhance comprehension and reasoning of more complex visual scenes. Current research, however, needs more development and investigation because it is not yet accurate meaning that it needs more language and context knowledge. In this analysis, many scene graph generation techniques are considered. 170+ research papers are gathered in this survey and categorized according to various approaches: Structure based scene graph generation, prior knowledge based scene graph generation, deep understanding based scene graph generation, and optimization based scene graph generation. In addition, a variety of resources were utilized to compile these research papers for this review and the difficulties encountered by current delving are described and evaluated in research papers. From this survey, we clearly explained the motive for choosing the generation of scene graphs and it helps analysts to develop new techniques related to the generation of scene graphs by addressing the drawbacks of remarks. Also, an assessment is presented utilizing the year of publication articles, toolset analysis, analysis of architecture for scene graph generation, dataset-based analysis, and performance evaluation. In this survey, the most regularly utilized methods are Optimization based methods for enriching the quality of the scene generation. Similarly, PyTorch is a recurrently used tool for detecting scene graph generation and the commonly used dataset is Visual Genome (VG) dataset. In addition, researchers widely used recall as a performance metric. The future scope will be focused on resolving the imbalance problem of the training data and the possibilities to emerge in multiple domain problems.
Footnotes
Author’s Bios
