Review on scene graph generation methods

Abstract

A scene graph generation is a structured way of representing the image in a graphical network and it is mostly used to describe a scene’s objects and attributes and the relationship between the objects in the image. Image retrieval, video captioning, image generation, specific relationship detection, task planning, and robot action predictions are among the many visual tasks that can benefit greatly from scene graph’s deep understanding and representation of the scene. Even though there are so many methods, in this review we considered 173 research articles concentrated on the generation of scene graph from complex scenes and the analysis was enabled on various scenarios and key points. Accordingly, this research will enable the categorization of the techniques employed for generating the scene graph from the complex scenes that were made based on structured based scene graph generation, Prior knowledge based scene graph generation, Deep understanding based scene graph generation, and optimization based scene graph generation. This survey is based on the research techniques, publication year, performance measures on the popular visual genome dataset, and achievements of the research methodologies toward the accurate generation of scene graph from complex scenes. Towards the end, it identified the research gaps and limitations of the procedures so that the inspirations for introducing an advanced strategy for empowering the advanced generation of graph scenes from the complex scene will the empowered.

Keywords

Scene graph generation graph neural network visual relationship detection graph convolutional network

1. Introduction

Computer Vision is the field of computer science where it deals with the images related inputs. It has been showing its progressive research aspect in the past decades. It enables abstract computers to see and solve problems in any situation and constraints. It also enables the Artificial Intelligence level of knowledge gaining and processing to age the limit of the computers to jump to the next scope of possibilities. Since the computer vision is the core eye of the computer devices it proves the significant possible actions by the computers. Also, it is a vast and key area in Artificial Intelligence but it is quite an expensive area of research as well as in real-life utilization, because of different dependency factors, computer vision takes very highly challenging aspects.

Deep learning [142, 91, 26] is a branch of artificial intelligence that uses multiple layers of neural networks to learn structured and unstructured data. It is still an active and evolving field of research, with many open problems and opportunities like a deep understanding of the visual scene and knowledge retrieval [101]. Also, it has been applied to various domains such as natural language processing, computer vision, voice recognition, and knowledge generation. Some of the factors that have contributed to the success of deep learning are the availability of massive datasets, the development of powerful computing hardware, and the advancement of novel algorithms and architectures. Deep learning methods have achieved remarkable results on this task by leveraging convolutional neural networks (CNNs) [49, 5, 133] and graph neural networks (GNNs) [12, 171, 131] to encode the visual features and the graph structure. However, there are still several challenges and limitations that need to be addressed, such as the scalability of the models, the diversity and trustability of the predictions, and the generalization to novel scenes.

Scene graph generation [113, 92] is a task that aims to extract semantic and structural information from an image by representing the objects and their relationships as a graph. scene graph generation has been developed using deep learning techniques that leverage large-scale image-text datasets and powerful Neural Network architectures. In recent years, scene graph generation has achieved remarkable results in many computer vision tasks that require a high-level understanding of the image content, such as image classification [9, 19, 67, 33, 40], image captioning [119] and image recognition [137] scene graphs can help generate more descriptive and accurate captions by incorporating the object attributes and relationships, visual question answering scene graph can help answer complex questions that involve reasoning over multiple objects and their interactions, image retrieval [21] scene graph can help retrieve images that match a given query based on the semantic and spatial criteria. Scene graphs can also facilitate other tasks such as knowledge representation [53], machine translation [106], and image summarization.

The scene graph depicts the relationships between the objects as well as their locations and categories. Numerous researchers have recently paid attention to inferring scene graph because it enhances the rich semantic connections between the objects from the input data. In addition, scene graph provides a context guide for image recognition tasks, this ricker semantic understanding of the scene graph has broad high-level vision applications. scene graph generation or Visual Relationship Detection (VRD) is a significant stage in a scene comprehensive of visual items. Scene graph generation has shown a lot of interest in recent times because of its capacity to give an organized, complete image portrayal for significant-level visual thinking. scene graph generation enables adequate perception and comprehensive understanding of scenes, particularly for 3D real-world scenes [154]. As a result, both scene graph generation models and the matching performance measures do not help us identify whether the neural network models really learn to capture the semantics of relations or only learns to fit the partial remarks in the training information while scene graph generation [59].

Some of the recent methods use Transformers [155, 160, 172] to encode both visual and textual features and predict objects and relations via a masked token prediction task. Other methods use CNN to extract visual features and GNN to model the visual relationships between objects. Some methods also explore the use of semantic point clouds [156, 108] as input modality and generate 3D scene graphs [171, 6, 84, 19] that capture the spatial and semantic information of the scene. It is still a challenging task that requires bridging the gap between images and texts, managing open-vocabulary settings, and dealing with noisy or incomplete annotations. For image and text retrieval task scene graph provides rich contextual information that can improve the accuracy and diversity of the retrieved images and text. Image retrieval algorithms can assess the semantic similarity and relevance between the two modalities by matching the scene graph nodes and edges with the words and phrases in the text. For instance, a scene graph can capture the attributes, actions, and interactions of the objects in an image, which can help distinguish between different scenes or scenarios that share similar objects. Scene graphs can also help knowledge retrieval by enabling hierarchical and compositional reasoning that can manage complex and abstract queries. The attention mechanism focuses only on the required part of the input data and this type of model eliminates the unwanted data for the models. The model gains a deep understanding of the part of the data and by that, it tries to predict. It enhances the performance of the training period of the scene graph generation models. Due to recent developments of knowledge over and differential input formats deep learning models are not sustainable and trustable the predictions. The principal construction of scene graphs still faces many challenges such as a lack of deep understanding of the input data and connectivity between multiple instances. The major thing is different learning models have different sets of problem-specific impacts on the evaluation metrics like accuracy and completeness of the scene graphs by gaining some prior knowledge and attention based approaches. Accordingly, the review of this learning model is sufficient to know what the exact achievements have been done yet in this topic.

The main intention of this survey paper is to identify the various strategies used in view of scene graph generation. In this review, the techniques that are currently in use are categorized as structure based, prior knowledge based, deep understanding based, and optimization based. The observation of the year of publications, various datasets, software tools used, architecture categorized, and so on led to this review. Additionally, recall is the most frequently utilized performance metric for scene graph generation. A general identification of the drawback of study papers is provided by the represented research gaps. Consequently, the represented research gap is viewed as a resource for expanding one’s knowledge of scene graph generation. As can be seen, this review is presented in sequential order.

Section 2 demonstrates the categorization of different scene graph generation techniques. The categorized techniques of scene graph generation’s represented research are explained and analyzed in Section 3. An examination of scene graph generation methods and its evaluation measures are summarized in Section 4, and Section 5 has the applications and future trends of the scene graph generation. The conclusion of this survey is explained in Section 6.

2. Scene graph generation

A scene graph is a semantic structured representation of input images. Moreover, it engages the gap between visual and semantic perception of a visual scene and the major challenge task is to detect and predict the relationships between the objects and subjects for complex scenes. Scene graph generation is a task of detecting and predicting the visual relationships among the objects and subjects in an input image and which is represented in a triplet manner $<$ subject, relation, object $>$ , or $<$ s, r, o $>$ . This paper focuses on reviewing and analyzing the different methods implemented for scene graph generation. It has been divided into four major ways structure based, prior knowledge based, deep understanding based, and optimization based. Figure 1 shows the classification of scene graph generation methods.

Figure 1.

Classification of scene graph generation (SGG) methods.

2.1 Structure based scene graph generation

Representation is a crucial and risky task in scene graph generation to learn an effective and accurate manner from the image. Most of the recent works have shown perceptual structure such as in a 2D representation manner, whereas structure based scene graph generation represents the scene graph in a 3D particular domain space. 3D scene graphs are knowledge graphs used in computer graphics to represent complex 3D scenes.

Zachary et al. [171] developed a reinforcement learning framework using GNN architecture to learn navigation policies. The model embeds 3D scene graphs into agent-centric feature spaces, capturing occupancy and semantic content while retaining trajectory memory. The model improves object search behaviour, long-term memory, and navigation objectives. Aayush et al. [6] developed Sim-to-Real scene graph (Sim2SG), a method for simulation to real transfer learning for scene graph generation. This method addresses the domain gap between synthetic and real data by decomposing appearance, label, and prediction discrepancies. The model shows significant improvements in both qualitative and quantitative aspects and is validated on both toy and realistic simulators.

Yun et al. [149] proposed a framework for compressing 3D scene graphs under communication constraints using graph theoretic tools, specifically graph spanners. The compression strategies are navigation-oriented and preserve the shortest paths between locations. The effectiveness of the model is demonstrated through synthetic robot navigation experiments in a realistic simulator. The authors present two algorithms, which use graph spanners to prune nodes and edges from a 3D scene graph, retaining navigation-relevant information and imposing a user-specified compressed size.

Nathan Hughes et al. [84] the article presents Hydra, a real-time spatial perception system that builds 3D scene graphs from sensor data. The system uses real-time algorithms for mesh, object, and place layers, as well as a segmenting approach for rooms. It is implemented in a parallelized architecture, combining mid-level perception processes with slower high-level perceptions. The system’s loop closure detection algorithm optimizes generation. The system outperforms batch offline methods in reconstructing 3D scene graph. Antoni R et al. [8] introduced Kimera, a system using a 3D Dynamic Scene Graph [7] to bridge the gap between robot and human perception. The system includes visual-inertial Simultaneous Localization and Mapping (SLAM) [158], an object localization module, a human pose estimation module, a metric for semantic 3D reconstruction, and scene parsing. It performs well in real-time datasets and photo-realistic simulations, including uHumans2, and can be used for real-time hierarchical semantic path planning.

Yue et al. [156] the authors propose a task to localize 3D bounding box changes and describe scene changes. They create a simulated dataset and a framework that incorporates different 3D object detectors. The framework improves both change detection and captioning tasks. Pretraining on the proposed dataset increases change detection accuracy by $+$ 12.8% when applied to real-world data. Ue-Hwan et al. [129] develop a 3D scene graph for intelligent agents, utilizing a 3D scene graph construction framework to represent physical environments, enhancing usability and scalability in real-time applications like Visual Question Answering and Task Planning [166]. Johanna W et al. [78] proposed a method for learning 3D semantic scene graphs from indoor reconstructions, using PointNet and graph convolution network (GCN) architectures. The method uses a semi-automatically generated dataset for domain-agnostic retrieval tasks, aiming for dense graphs with labeled instances, semantic relationships, and attributes.

Wu et al. [120] developed scene graph Fusion, a method for incrementally building semantic scene graphs in 3D environments using Red Green Blue (RGB)-D frames. The method aggregates PointNet features and introduces a novel attention mechanism for partial and missing graph data. It outperforms other methods and achieves better performance with other 3D semantic and panoptic segmentation methods. Iro Ameni et al. [52] developed a semi-automatic framework for constructing a 3D scene graph with unified semantics, 3D spaces, and a camera. The framework uses framing and multi-view consistency to enhance detection methods and aggregate information in 3D. The results were better than recent works. Sangjin K et al. [108] developed a low-power GCN processor for real-time 3D point cloud semantic segmentation on mobile devices. The GCN processor features sparse grouping based dilated graph convolution, two-level pipeline (TLP), and center point feature reuse. It reduces computation, external memory access, and core utilization, making it suitable for mobile mixed-reality devices and low-power accelerators. Liu et al. [154] developed a 3D scene graph generation system for logical data, achieving fine-grained substance class, connection marks, and high accuracy. The system uses Graph Feature Extraction and Graph Contextual Reasoning modules, with a multi-task learning strategy [31].

Sarthak et al. [103] introduced SceneGraphGen, a deep auto-regressive model that learns probability distribution over labeled and directed graphs using a hierarchical recurrent architecture. The model generates diverse, semantically plausible scene graphs, outperforms Graph Recurrent Neural Network (RNN), and can transfer to novel images, detect unusual graphs, and extend incomplete graphs. Wu et al. [118] developed a hierarchical context based method to extract entity, global, and scene contexts from images, improving emotion recognition accuracy. The method was tested on various datasets, achieving an Emotion Recognition Score (ERS) of 0.2153. The method addresses challenges in emotion recognition on multilabel classification datasets.

Yu et al. [64] developed a debiasing Cognition Tree (CogTree) loss for unbiased scene graph generation. It focused on easy understanding relationships and identifying distinct ones. The loss was designed for this structure, improving model performance, and enabling accurate relationships to be distinguished from coarse to fine. However, the method lacks sufficient knowledge to optimize the CogTree structure. Wang et al. [134] developed a human-like Hierarchical Entity Tree (HET) for scene representation and developed a scene generation. Hybrid-Long Short-Term Memory (LSTM) was used to encode HET’s hierarchy and context, and a Relation Ranking Module was developed to dynamically adjust scene generation’s key relations. This method achieved high results in scene graph generation and image-specific relations, crucial for downstream tasks. Chuhao et al. [20] introduces a novel loop detection method for indoor simultaneous localization and mapping (SLAM). The method uses geometric information of indoor scenes to generate the scene generation. It matches the scene graphs based on their topology descriptors and volume similarity to find loops in the input. The experiment and result of the method is evaluated on real-world dataset [85] collected in dense indoor scenes and it outperforms the existing semantic-aided loop detection methods.

2.2 Prior knowledge based scene graph generation

Prior knowledge is knowledge gained from multiple or single contexts. This knowledge can be utilized to gain more knowledge about the problem from various aspects. Therefore, prior knowledge can be distinguished in various such as linguistic prior, visual prior, knowledge prior, context prior, etc. scene graph generation involves relationships as combinations of objects with a wider semantic space.

Li et al. [72] introduce model-agnostic label semantic knowledge distillation (LS-KD) for unbiased scene graph generation, addressing scene graph generation datasets with multiple predicates and missing annotations. It improved the performance by incorporating iterative and synchronous self- knowledge distillation strategies. Tianshui et al. [122] propose a system incorporating statistical correlations into deep neural networks for scene graph generation. This improves performance and addresses the uneven distribution of real-world relationships. The method has been tested on a large-scale visual genome dataset and has a new evaluation metric, mean recall. Khan et al. [82] developed a method to generate expressive scene graphs using commonsense knowledge that infuses for visual understanding and reasoning of the images. The method uses a heterogeneous knowledge source and graph embedding to improve performance and expressiveness in visual understanding and reasoning tasks.

Jiuxiang et al. [55] the article presents a novel scene graph generation algorithm that incorporates external knowledge and image reconstruction loss. It aims to improve generalizability in scene graph generation by incorporating commonsense knowledge from an external knowledge base and an auxiliary image reconstruction path. The proposed framework outperforms benchmark datasets. The knowledge-based feature refinement module refines object and subgraph features, while the image-level supervision module reconstructs images from detected objects. Jaewon et al. [97] propose a method for visual relationship detection, which involves detecting objects and classifying predicates. The authors address challenges like intra-class variance, long-tail distribution, and class overlapping. They use language and visual modules, spatial vectors, word vectors, and bounding boxes to improve performance. Their proposed spatial vector effectively detects unseen visual relationships without costly linguistic knowledge distillation or complex loss functions.

Wenbin et al. [133] propose a method for generating a topic scene graph by distilling attention from image captions. The method reduces trivial content and noise, transforming attention from individual objects to relational events. Haiyan Gao et al. [47] proposed the Balanced Award-Punishment Model (BAPM) to address long-tailed dataset biases using energy-based learning and causal reasoning. The model combines stochastic strategy, knowledge transfer, and lateral inhibition loss, achieving state-of-the-art performance on both quantitative and qualitative metrics. Jin Chen et al. [53] developed an scene graph generation model for unannotated videos, leveraging annotated images. They aimed to infer unseen dynamic relationships [152] and adapt objects and static relationships. To reduce their data distribution disparity between image and video frames, the method combines external common sense knowledge and hierarchical adversarial learning.

Yuyang et al. [96] present a Dual scene graph Convolutional Network (Dual-SGCN) method for predicting motivation using complex visual and semantic contexts. The method uses multi-task co-training and unbiased motivation inference. Bingqian et al. [17] the article discusses atom correlation based graph propagation (AG-GP) for scene graph generation to address long-tailed distribution in datasets. It focuses on diverse atom correlations, node feature initialization, graph propagation, and visual feature refinement, enhancing feature enhancement and promoting comprehensive knowledge. Runqing et al. [101] the article presents a planning system to enhance the performances of long-term manipulation tasks for autonomous robots, combining visual semantic understanding, regression planning, and multi-level representation using a knowledge base. The method improves success rates and generalization performance in simulation and real-world environments.

Zhanwen et al. [166] the article introduces a new approach to unbiasing scene graph generation models, called Explicit Ontological Adjustment (EOA), using a commonsense knowledge graph and edge matrix for improved relationship detection. Shixing et al. [105] the authors introduce Cross-modal processing and commonsense reasoning (CMCR) as a Dense video captioning model, combining visual and audio information. The model improves the event localization accuracy and generates more logical captions, outperforming state-of-the-art methods [38]. Gayoung et al. [42] the authors introduce Visual scene graph generation-Net, a deep neural network model for irrelevant pairs, extracting spatial-temporal context, and dealing with relationship class imbalance. The model achieves superior results on various datasets [93].

Zhang et al. [75] developed a knowledge-based model by adjusting visual contextual dependency and Relational proposal network [63], achieving better global and relevant object connections, outperforming other approach. Zareian et al. [10] developed a graph-based neural network (GB-Net), which iteratively spreads information between and within graphs, achieving high precision. Zareian et al. [11] developed a visual common-sense model for enhanced scene understanding using affordance and intuitive physics data, improving scene graph generation strategy precision. Tian et al. [159] developed road scene graph for intelligent vehicles, utilizing topological graphs for object suggestions and relationships, allowing easy processing. Tzu-Jui et al. [127] developed a novel scene graph generation framework using supervised and semi-supervised relation learners to reduce biases in incomplete annotations, capturing minority classes better, and compatible with existing models. Zhenxing et al. [172] developed a novel subgraph-based object context-masked network (SOCNet) for scene graph generation for better performance on challenging datasets.

Aditya et al. [9] developed the Visio-Lingual Message Passing GNN (VL-MPAG Net) to localize objects and relationships in natural images. The approach uses three modules: Proposal Graph Generation (PGG), Structured Graph Learning (SGL), and Joint Proposal Scoring. It outperforms baselines and is compared on four public datasets. Zhang et al. [164] develop Saliency-Guided Message Passing (SMP) for visual relation saliency, enhancing scene graph structure generation and generalizability in many applications like cross-model text retrieval and image captioning. Xu et al. [86] developed a multi-scale context modeling method for scene graph inference, combining object and region-centric contexts, and integrating context-fused inference.

Liu et al. [13] developed a scene-graph-guided message-passing network for dense captioning, achieving comparable results and overcoming difficulties in distinguishing relationships or amphibious objects. Li et al. [110] developed an attentive gated graph neural network using VRD to spread a message. The network uses edges as relationships and nodes as objects, using an attention mechanism to measure connectivity. The method’s efficacy was demonstrated through extensive testing on the widely used dataset but did not produce comparable outcomes.

2.3 Deep understanding based scene graph generation

A deep understanding of scene graph generation requires not only the ability to detect and classify objects and predicates but also the ability to capture the context and reasoning behind the scenes in the input data. It is an in-depth exploration of a subject that requires higher-order thinking and critical analysis. The attention mechanism [1] is creating a new scope of view in all the fields of Artificial Intelligence, as it allows models to focus on the important part of the inputs according to the guidance.

Gao et al. [68] propose a video scene graph generation framework using a transformer-based encoder-decoder structure and role-aware cross-attention module. They introduced BIG, which uses a video grounding model and extends it to manage multiple instances of predicates with different time slots. The framework achieves superior performance on benchmarks, demonstrating its effectiveness in temporal bipartite graph formulation. Lin et al. [143] developed a Graph Property Sensing Network (GPS-NET) for scene graph generation that explores edge direction, node priority, and long-tailed relationship distribution. It outperforms existing datasets and effectively captures scene graph properties and handles scene graph generation class imbalance problems. Zhiyuan et al. [169] this article introduces an remote sensing image scene graph generation by fusing contextual information and statistical knowledge (RSSGG_CS) model for scene graph generation of remote sensing images that combines contextual and statistical knowledge to improve feature extraction and relationship prediction. The S2SG dataset, the first dataset of RSSGG, shows that fusing contextual information suppresses object pairs without semantic relationships, reduces search space of relationship predicates, makes the model output more consistent with real-world relationships, and facilitates commonsense reasoning.

Lu et al. [155] propose a scene graph generation model that predicts visual relationships in images sequentially and conditionally. The model uses transformer-based encoder-decoder architecture, reinforcement learning, and sequential conditioning to resolve ambiguities and reduce bias. It achieves strong generalization and robustness. Dong et al. [96] introduce a part-and-sum transformer for visual composite set detection, utilizing composite queries, part-sum interaction, and factorized self-attention layer. Part-and-sum transformer outperforms custom two-stage methods on VRD and Human-Object Interaction (HOI) detection and is generalizable to other two-level hierarchical tasks. Zhecan et al. [173] developed a scene graph based Enhanced Image-Text Learning (SGEITL) using visual scene graphs in multimodal transformers. The model includes various steps for improving performance on various datasets.

Suprosanna et al. [115] presented Relation former as a one-stage transformer-based model for image-to-graph generation, capable of handling tasks like pathway network extraction, graph extraction, and scene graph generation. It uses $N+1$ tokens, deformable attention, and stochastic relation sampling. The model outperforms existing methods in 2D and 3D and bridges the performance gap with two-stage methods. Siddhesh et al. [107] present a novel scene graph generation framework using iterative refinement to jointly reason over objects and relationships in images or videos. The framework is based on a Markov Random Field model and uses transformer architecture for iterative refinement on datasets. Lite et al. [76] presented an automatic hazard identification method for construction scenes using construction scene graphs. Tested on four working scenes, it achieves 97.82% hazard identification accuracy, enhancing safety monitoring and multimodal information fusion. Xiaogang Xu et al. [147] developed a novel framework for generating images from scene graphs and object crops, using sequential crop selection and progressive scene graph to image generation. The framework utilizes the inception model which outperforms existing methods and users prefer its images for consistency and visual quality.

Wang et al. [25] developed a novel scene graph generation framework using Transformer networks to convert image data into linguistic descriptions of objects and their relationships. The framework improves on datasets, to enhance the understanding of Computer Vision and Natural Language Processing (NLP) tasks. Yuren et al. [150] developed a Relation Transformer (RelTR) model for scene graph generation, predicting subjects-predicate-object triplets using attention mechanisms. The model outperforms state-of-the-art methods on various datasets. Rajat et al. [99] presented a relation Transformer architecture for scene graph generation, capturing contextual dependencies and predicting relationships using complex global object interactions and a positional encoding algorithm. Pawit et al. [89] article investigates scene graphs for autonomous driving and driver-action prediction using a self-supervision pipeline. It incorporates attention mechanisms to create heatmaps and enhance interpretability. The system outperforms fully supervised approaches.

Xingning et al. [137] propose an unbiased scene graph generation method using Stacked Hybrid-Attention (SHA) Net and Group Collaborative Learning (GCL) strategy. The SHA Net strengthens the encoder, while GCL optimizes the decoder, mitigating biases and compensating for under-fitting. This approach significantly improves performance on various datasets. Li et al. [168] developed a Dual Attention Messaging Passing (DAMP) model for scene graph generation, addressing imbalanced information transmission. The model integrates internal and external attention mechanisms, regulating information transmission efficiency and improving scene graph generation performance, achieving comparable results on VRD datasets. Liu et al. [12] presented a region-aware attention learning method for scene graph generation, that constructs an attention space to identify the regions of objects and relationship predicates. The method incorporates object and predicate-wise attention GNNs and intra and inter-triplet learning mechanisms and this method outperforms on various datasets. Li et al. [90]introduced a novel multi-scale semantic fusion network (MSFN) for remote sensing scene understanding. The MSFN framework includes object detection, Sparse Relationship Extraction Network, and Multi-Scale Graph Convolutional Network. It effectively integrates semantic content and predicts potential relationships, outperforming the Remote Sensing scene graph dataset. Shanshan et al. [119] developed a novel image captioning model using multi-level alignment, semantic knowledge, and spatial relationships for achieving the best results.

Yao et al. [162] the article propose a novel deep learning architecture called Multihub Driven Attention Network (MHDANet), a novel deep learning architecture, improves scene graph generation digital twin tasks by focusing on valid connections, classifying objects, and predicting relationships, achieving the best performance on various datasets. Zhou et al. [51] developed a deep sparse-based graph attention network for scene graph generation, aiming for effective message passing and contextual learning. The model uses Faster R-CNN, Relationship Measurement Network, and Graph Attention Network (GAT) for contextual learning and object classification. i Qi et al. [79] propose a novel Attentive Relational Network for scene graph generation, consisting of object detection, semantic transformation, graph self-attention, and relation inference modules. Mi et al. [73] developed a model for Visual Relationship Detection, capturing object-level and triplet-level dependencies, and outperforming existing methods on VRD datasets. Tian et al. [94] introduce a multi-level semantic task generation network that jointly refines the features of different levels of semantics in the input image by using message-passing techniques and the experiment on two datasets shows the performance of the model is better on different vision task generations.

Hanbit et al. [48] introduce a Multi-Scale Contrastive Learning approach for complex scene generation, improving discriminative ability through locally defined pretext tasks and enhancing local representation at multiple scales. Hua et al. [139] proposed a video captioning model that combines Adversarial Reinforcement Learning and object-subject relational graph. The model extracts motion and attributes information, analyzes object motion, and connects visual content and language. It uses an Adversarial Reinforcement Learning method and multi-discriminator to learn relationships between visual content and words. The method adapts to various application scenarios and achieves satisfactory performance. Tathagat et al. [126] developed a Variational Autoencoder (VARSCENE) model to generate graphs with minimal distribution discrepancy and introduce plausible variations. The encoder embeds the graph into semantic representation vectors, while the decoder generates scene graphs by learning components of nodes and edges. It accurately mimics the underlying distribution of scene graphs in experiments.

Fu et al. [167] proposed a method that addresses the limitations of existing scene graph generation methods when the image lacks enough visual contexts. This methodology transforms textual information into contextualized knowledge that is supported by visual items that enhance contexts. Chu et al. [87] discussed a Trans Multi-Object Tracking (MOT) based system for robust target association in tracking modules, using a spatial-temporal graph transformer. The system models spatial-temporal relations using sparse weighted graphs, estimating association from loosely filtered detection, and enhancing MOT in complex scenes. The system is evaluated on multiple versions of MOT datasets like MOT15 [71], MOT16 [4], MOT17, and MOT20 [88] with the best results. TransMOT can combine output from a generic image object detector with learnable detectors such as Detection Transformer (DETR) [151] to form a fully end-to-end tracker.

Charulata et al. [23] introduce Multiple Attribute Detector (MAD) modules for capturing structured attribute information in objects, integrating with existing scene graph generation frameworks without altering relation detection, and outperforming various datasets. Wang et al. [62] propose a scene graph-driven Multi-modal Multi-granularity Multi-task learning (M3S) framework for Multi-modal Named Entity Recognition (MNER), aiming to improve visual and textual information utilization. The framework uses a novel multi-task approach, including Named Entity Segmentation and Categorization, and uses scene graphs for modeling objects and relationships. The Multi-granularity Gated Aggregation (MGA) mechanism captures inter-modal interactions and extracts critical features for named entity recognition. Zhecan et al. [172] developed a scene graph Enhanced Image-Text Learning (SGEITL) for commonsense reasoning of images, incorporating visual scene graphs and a multi-hop graph transformer. The framework outperforms various datasets, leveraging structure knowledge extracted from visual scene graphs.

Jingwen et al. [54] present a Multimodal Graph Inference Network (MGIN) for scene graph generation, using Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI). MGIN enhances inference capability for triplets, particularly for uncommon samples. The MGIN module incorporates statistical knowledge, while TMFI combines visual and semantic features for efficient prediction. Lyu et al. [36] proposed a weakly supervised visual-textual (vtGraphNet) scene graph for complex visual grounding. The model learns the Bi-modal scene graph, attribute-assigning and relationship-referring models, and a graph consistency loss function. Validated on the dataset, it outperforms state-of-the-art methods in handling simple and complex visual grounding tasks.

2.4 Optimization based scene graph generation

Optimization based scene graph generations are the most challenging part and in this part, much of the recent popular research has enhanced the performance of constructing the scene graphs from the base input. Based on the optimization technique we have collected 29 papers from this list we have chosen the most recently published research papers for the discussion.

Qiu et al. [157] proposed a SGTracker a tracking-based approach that incorporates temporal and spatial contexts. It tracks objects and determines object and predicate labels. SGTracker outperformed existing methods in scene graph localization on the Virtual Home Action Genome (VirtualHAG) dataset, which includes per-frame consistent annotations and relationships requiring both spatial and temporal context. The experiment demonstrated the efficacy of pre-training on the proposed dataset and its potential in real-world scenarios. Swarnendu et al. [104] the authors propose a novel Im2Graph model for scene graph generation that consists of two phases: extracting image captions from local regions, creating a conceptual graph generation algorithm, and optimally combining them.

The model is constrained-based [28] and adheres to principal Explainable Artificial Intelligence principles. It can be plugged into any proposition and caption generation system and used instantaneously. Experiments on the dataset show rich concept graphs can be generated without explicit graph-based supervision. Zhang et al. [65] propose an efficient scene graph generator that considers visual, spatial, and semantic features using a late fusion strategy. It investigates the key factors impacting performance and visualizes learned visual features for relationships. The model’s efficacy is also examined for its effectiveness on an open-image dataset. Lin et al. [144] proposed a subset matching network (SM-Net) that handles the occlusion problem in complex scenes. The network decomposes the scene graph into node subset and edge subset and jointly predicts their categories. They evaluated the SM-Net on various datasets with reasonable results and it shows the robustness to occlusion.

Zhiyuan et al. [170] the article introduces a segmentation-based model for generating remote sensing image scene graph generation (SRSG), generating more accurate results for scene graph with segmentation. The model embeds morphological features and maps them to semantic space, resulting in a new dataset called segmentation results to scene graphs (S2SG). The SRSG model outperforms previous methods in generating remote sensing image scene graphs. Zhuoyue et al. [173] develop a semantic description for monocular digestive endoscopy using a scene graph and clustering algorithm, enhancing feature matching accuracy and robustness. He et al. [123] propose a Decomposition and Composition (DeC) method to avoid biased predictions in scene graph generation by decoupling visual features into intrinsic and relation-dependent components, improving performance on Visual Genome (VG) and Genome Question Answering (GQA) datasets. Sangmin et al. [117] developed a new framework for scene graph generation, addressing challenges like ambiguity, asymmetry, and higher-order contexts. It uses local interaction heads, direction-sensitive encoding, and Bi-directional Relationship Classification for better prediction.

Zero-shot learning for scene graph generation is a crucial task that aims to produce a structured representation of the objects and their relations in an image, without requiring any training data for the target classes. This technique could leverage the result for various applications such as image captioning [119], visual question answering [130], and image retrieval [58]. It improves the losses for novel compositions of scene graph generation [15] and the performance of scene graph generation methods in predicting zero-shot mode. This can be achieved by using a knowledge graph completion strategy [148] to generate the missing information of zero-shot triples and then integrate it with the visual features of images. Integrating common sense knowledge for scene graph generation [140], particularly for zero-shot relation prediction, provides an alternative strategy. This can be accomplished by creating graph mining pipelines and integrating them on top of cutting-edge scene graph generation frameworks to model the neighborhoods and pathways around items in an external commonsense knowledge graph. Motoharu et al. [82] propose open-set scene graph generation for detecting unknown objects and their relationships, extending scene graph generation’s applicability in real-world situations, comparing existing methods, and addressing limitations.

Li et al. [153] introduce a Lexical Knowledge-aware Memory Network (LKMN) for zero-shot relationship prediction in scene graph parsing. The method reduces intra-class variation by distilling linguistic knowledge of different objects. The LKMN’s effectiveness is evaluated on the dataset, showing effectiveness in zero-shot, few-shot, and supervised settings. Guo et al. [151] developed a model for one-shot scene graph generation tasks using Multiple Structured and Commonsense Knowledge. They used the Instance Relation Transformer encoder to explore the visual entity information. Their experimental results outperformed existing methods for multiple structured knowledge. Hengyue et al. [49] present a Fully Convolutional scene graph generation model that detects and predicts objects and relations simultaneously. Using a bottom-up method, relationships are encoded as 2D vector fields called Relation Affinity Fields (RAF) and objects are encoded as bounding box center points. Extensive experiments show efficacy, efficiency, and generalizability, with competitive recall and zero-shot recall, and reduced inference time.

Zhi et al. [171] propose Union Visual Translation Embedding (UVTransE) for VRD and scene graph generation, an extension of the VTransE method. That incorporates contextual information and maps entities and predicates into a low-dimensional semantic space. They used a recurrent-based language model that outperforms previous translation-based models. Nikolaos et al. [83] introduce an adaptive local-context-aware classifier based on object categories, outperforming most approaches. They mine and learn predicate synonyms, apply distillation-like loss, and evaluate the model on various datasets, showing superior performance.

3. Bibliographic analysis

In this section, we have analyzed the scene graph generation topic under various aspects such as year of publication, valuation based on implementation tools, dataset-based analysis, different neural network architectures for scene graph generation, and the evaluation of performance measures.

3.1 Analysis based on dataset

In this part, based on the analysis we provide of the most popularly used datasets for the scene graph generation task. Figure 2 depicts different datasets used for the generation of scene graph. According to the analysis, Visual Genome (VG) [100] is the most frequently used dataset. Based on the analysis Fig. 2 shows the most popularly used datasets for scene graph generation and the visual genome dataset proves the most prominent dataset for scene graphs. From the analysis, we grouped the datasets into five major categories: 2D images, videos, 3D representations, simulated, and custom-built datasets.

Figure 2.

Analysis on dataset.

3.1.1 2D Datasets

Most of the research on VRD and scene graph generation has focused on 2D images based on their problems. Consequently, several 2D image datasets are accessible, and Table 2 summarises their statistics. Some of the most frequently utilized datasets in this investigation are listed below.

Microsoft common objects in context (MS-COCO)

MS-COCO [166] provides images that detailed everyday scenarios with typical objects and subjects around the environment. This advances the development of object segmentation, recognition, and detection tasks. It contains a total of 2.5 million labeled instances in 328k images and the dataset is unique in that instance-level segmentation masks are annotated, enabling more precise detector evaluation. And it contains a rich level of contextual knowledge about the 80+ object categories [153, 30, 76, 161, 173, 32, 106, 11, 21, 34] following articles models are trained under this dataset and achieved comparable results over the different independent tasks.

Visual genome (VG)

VG [100] dataset gathers detailed annotations of the objects, properties, and relationships inside each image with the aim of enabling the modeling of relationships between the objects in images. It contains more than 100k images with an average of 21 objects, 18 attributes, and 18 pair wise relationships between the objects in each image. The VG dataset represents the densest and largest dataset for image captioning and visual question-answer tasks. Many prior works have investigated automatic methods (such as merging and filtering) for clearing up objects and relations in the image annotations and created their own visual genome versions to reduce the noise. In this review, a total of 76 research articles use the VG dataset for training. From that, some of them are subsampled and mixed with other datasets like VRD, COCO, AG, GQA, and Open Images. These are represented in Table 1 as follows.

Table 1
Combination of VG dataset and other dataset counts

Dataset name	Count
Visual genome (VG)	46
Visual genome (VG) and action genome (AG)	2
Visual genome (VG) and common objects in context (COCO)	5
Visual genome (VG) and genome question answering (GQA)	3
Visual genome (VG) and visual relationship detection (VRD)	14
Visual genome (VG) and visual relationship detection (VRD) and open images	5
Visual genome (VG) and open images	1

Of these, one-shot VG [151], VG-KR and VG150 [164], VG-R10 and VG-A16 [121], VG-DR-NET and VG-MSDN [88, 138], VG-KR [69], VG200 [50], sVG [14] have released cleaned annotated versions. And other works [139, 27, 92, 76, 118, 173, 87, 48, 62, 9, 131, 129, 125, 52, 105, 39, 21] use problem-specific splits and modifications, and they disabled the direct comparisons with the existing experiments.

Genome question answering (GQA)

GQA [30] is a new dataset that aims to measure visual understanding in computer vision and improve visual question answering. It features compositional questions over real-world images and scene graphs of their objects, attributes, and relations. Each question has a functional program that outlines the phases in the reasoning process as well as a structured description of its semantics. Additionally, the dataset offers fresh measures for evaluating the precision, consistency, reliability, and plausibility of model predictions. The questions are grammatical, diverse, and idiomatic, and are based on natural-language crowd sourced scene graphs. The dataset is designed to be balanced and clean and to avoid language and world priors. In the [23, 130] research articles use this dataset to train their model and the model achieved comparable results over the different independent tasks.

Visual relationship detection (VRD)

VRD [22] dataset is the most promising benchmark dataset for the scene graph generation task. It taps the problem of long-tailed distribution of infrequent relationships among the dataset and it contains 5k images that have 6672 diverse types of relationships. The relations are widely fit into categories, such as action, spatial, verbal, preposition, and comparatives. In this analysis [171, 168, 49, 58, 96, 97, 109] these articles comparably utilized the dataset, and their model’s performance has been achieved the best results.

3.1.2 Video datasets

Action genome (AG)

AG [56] dataset represents actions as compositions of spatio-temporal scene graphs. AG captures the changes between objects and their pairwise relationships of the input frames. It contains 10000 videos with 0.4 million objects and 1.7 million relationships. And it provides frame-level scene graph labels for the components of each action. AG is the first and benchmark video-based database and it provides both action and spatio-temporal scene graph labels. According to the analysis [48, 9, 74] articles have used this action genome dataset for their problem, and they achieved better results over different sets of settings.

Video visual relation (VidVRD)

VidVRD [146] is a dataset that extracts instances of visual relations of interest in a video. A visual relation instance is represented by the relation triplet made up of the subject and object motions. The dataset contains 1k videos with clear visual relations, and it covers common subjects/objects of 35 categories and predicates of 132 categories.

Video object relation (VidOR)

VidOR [145] dataset contains 10k videos gathered from the YFCC100M collection together with many fine-grained annotations for relation understanding. Most of the objects are annotated with bounding-box trajectories to identify their spatiotemporal placement in the films. This results in the annotation of about 5K objects and 38K relation instances. According to our investigation [68, 53, 42] used these two datasets for their problem-specific estimation and outperformed other state-of-the-art models in terms of performance.

3.1.3 3D Datasets

3DSGG is a dataset that provides 3D semantic scene graphs for large-scale 3D reconstructions of indoor environments. The scanned environment creates semantic graphs that contain nodes as 3D objects and edges as semantic connections. The dataset was first used in [61] and it was generated in a semi-automatic way. These [61, 120, 19, 108, 154] articles have used this dataset for their training of the model, and they achieved better results over the existing methods and retrieval tasks.

3.1.4 Simulated datasets

According to these [171, 6, 149, 84, 8, 77, 156, 129, 101] research articles the authors have created their own datasets for their specific task by simulators. Lack of knowledge and data availability they simulated the datasets for their requirements, and they published these datasets open to all for future works. Most datasets in our analysis are simulated by using the unity-based simulator to create their environments. And also, they are randomly placing the targets and obstacles in the environment for the model to train under any circumstances in the real world. This kind of generated samples comparatively requires minimal cost for the generation and training of high-performance models. From the analysis [149, 118, 77, 156, 101] have created their own environment for the robot’s actions estimations and movement predictions.

3.1.5 Custom-built datasets

From this analysis [139, 27, 93, 76, 118, 170, 87, 48, 131, 95, 129, 125, 52, 105, 40, 21] articles are created their own dataset according to their task-specific requirements. Some of them are sourced from outer bodies such as NGOs, and organizational sites. And some of them populated the dataset by their agents according to their task-specific problems. Comparatively, each article achieved better results with respect to their problem needs.

Table 2
The statistics of common scene graph datasets

	Dataset	Object	Bounding box	Relationship	Triplet	Image	Source link
1	3D semantic scene graph Generation (SSG) [61]	534	–	40	0	48,000	https://3dssg.github.io/
2	Action genome [56]	64000	–	32000	–	10K (82h)	https://github.com/JingweiJ/ActionGenome
3	Genome Question Answering (GQA) [31]	226,000 Questions		1878 Answers		113,018	https://cs.stanford.edu/people/dorarad/gqa/index.html
4	Microsoft Common Objects in Context (MS-COCO) [128]	80	55000	–	–	330,000	https://cocodataset.org/#home
5	Open Images [3]	57	3,290,070	329	374,768	9,173,275	https://storage.googleapis.com/openimages/web/index.html
6	Scene Graph [29]	266	69,010	68	109,535	5k	http://imagenet.stanford.edu/internal/jcjohns/scene_graphs/ sgdataset.zip
7	S2 scene graph (SG) [170]	80		12		68,000	https://github.com/leonhwb/s2s_dataset
8	Visual Genome (VG) [126]	1,600	282,460	117	203,375	58,983	http://vrs-vg.com/
9	Video Visual Relation (VidVRD) [146]	35		132		1,000	https://xdshang.github.io/docs/imagenet-vidvrd.html
10	Video Object Relation (VidOR) [145]	80	–	50	–	10,000	https://xdshang.github.io/docs/vidor.html
11	Visual Genome [100]	33,877	3,843,636	40,480	2,347,187	108,077	http://visualgenome.org/
12	Visual Relationship Detection (VRD) [22]	100	–	70	37,993	5,000	https://cs.stanford.edu/people/ranjaykrishna/vrd/

NB. “–” indicates that this attribute is not released.

3.2 Analysis based on year of publication

Regarding this survey, 170+ papers of scene graph generation were taken into long stretches of distribution. In Fig. 3, the various distribution year of this survey was demonstrated. This shows that one article was taken in the year 2013, 2 articles were taken in the years of 2015 and 2017, In 2018 and 2019, 7 and 8 articles were taken, 27 and 28 papers are obtained in 2020 and 2021, 47 articles are taken from 2022, and 21 research papers are taken in the year of 2023. In 2022, additional studies [135, 45] on scene graph generation were published from 191 and 138 reviewed papers.

Figure 3.

The analysis on the publication year.

3.3 Analysis based on implementation tools

The standard approaches to analyzing software tools are discussed in this section. The implementation tools utilized for efficient scene graph generation are discussed in depth in Fig. 6. In the analyzed research papers, the different implementation tools are PyTorch, Python libraries, CUDA, TensorFlow, and CUDNN. Figure 4 shows that PyTorch is the tool that is used the most frequently for scene graph generation.

Figure 4.

Analysis based on implementation tools.

3.4 Architecture for scene graph generation

This subsection displays the various Neural Network architectures used for scene graph generation of review analysis. Figure 5 illustrates the different architectures that have been used for the scene graph generation. From the analysis, it shows that Graph Convolutional Networks (GCN), and Fast Region Convolutional Neural Networks (FRCNN) are the most popularly used networks have been utilized for the generation of scene graphs.

Figure 5.

Category analysis.

4. Performance evaluation methods for scene graph generation

According to the survey, we have collected and analyzed some of the standard evaluation techniques and standards for the scene graph generation task in this part. Then, we presented the quantitative findings of each advanced model on the popular VG dataset.

4.1 Methods for evaluation

Scene graph generation requires problem-specific evaluation methods to analyze the performance of the model. The analyses have been categorized into four major methods of this survey. This section describes the common evaluation tasks for scene graph generation as follows:

Predicate classification (Pred cls.) [29, 82]: It determines which pairs interact and classifies the predicate of each pair using a set of localized objects with category names.

Scene Graph Classification (SG cls.) [29]: From a given set of localized objects, it predicts the relationship and object categories of the subject and object in each pairwise connection.

Scene Graph Generation (SGGen.) [29]: It predicts the predicate between each pair of objects observed in a set. It is more like phrase detection, but without subjects and the objects bounding boxes must at least partially overlap with their respective ground truths. Because SGGen only receives one complete triplet, the results cannot accurately express the detection effects of each element in the entire scene graphs.

4.2 Common metrics

Recall@K. [122] It measures the amount of true positive prediction among all the possible positive predictions. The existing metric for evaluating scene graph generation is the image-level recall, which calculates the true relationship which predicted from the top K confident relationship predictions. Where K is the maximum prediction allowed per object pair. Some research has computed the R@K with a constraint for only one relationship obtained from the object pair. While the unconstrained metric evaluates the model’s reliability.

Precision@K. [153] In the VRD task, it uses Precision@K to measure tagging accuracy. In OpenImages VRD Challenge, results are evaluated using Recall@50, mean AP of relations (mAPrel), and mean AP of phrases (mAPphr). mAP is a strict metric, penalizing predictions if no ground truth annotation exists.

Zero-Shot Recall@K [142, 97] In these research articles they proposed zero-shot relationship learning to assess model extensibility in real-world long-tailed relationships. And a single wRtr@K value can determine the zero or few-shot performance which is linearly aggregated for all $n\geqslant$ 0.

4.3 Quantitative performance

We present the quantitative result comparison of the scene graph generation methods with graph constraints on the VG [100] dataset. “–” indicates that there is no evaluation of that particular parameter. Some of the methods employ multiple methodology approaches, problem-specific networks, or priors; these are grouped based on the most prominent methodology approach they employ. Tables 3–6 show the evaluation using the range of recall for prior knowledge based, structure based, deep understanding based, and performance based scene graph generation, respectively.

Table 3
An evaluation using the range of recall for prior knowledge based scene graph generation

Algorithm	Methods	Pred cls (R@100)	SGCL (R@100)	SGGen (R@100)	Year	Author
Iterative self KD and Synchronous Self KD	Agnostic label semantic knowledge distillation, model-agnostic debiasing	41.60	28.80	19	2022	Li et al. [72]
Knowledge Embedded Routing Network	Statistical correlations	19.20	10	7.3	2019	Tianshui et al. [124]
FRCNN, Bi-LSTM, and G refinement G enrichment	Commonsense knowledge graph	–	–	39.12	2022	Khan et al. [84]
FRCNN, Bi-LSTM, FPN	Self-learned knowledge based learning: entity classification and relation classification	26.10	15	11.20	2022	Gao et al. [47]
FRCNN, Bi-LSTM and G refinement, G enrichment	Commonsense knowledge graph	39.20	18.30	–	2023	Zhanwen et al. [166]
Encoder – decoder bidirectional LSTM, RPN, Glove, Roi	Knowledge based model with adjustable visual contextual condition	67.19	39.84	37.63	2022	Zhang et al. [75]
FRCNN, Bi-LSTM, FPN	Self-learned knowledge based learning: entity classification and relation classification	67.51	40.14	37.69	2020	Tzu-Jui et al. [127]
FRCNN, GGNN, Bridging algorithm	Graph Bridging Network (GB-Net)	68.20	38.80	30	2020	Zareian et al. [10]
Encoder, Decoder	Global local attention transformer	19.30	10.40	0	2020	Zareian et al. [11]
LSTM	Concept region message passing mechanism	71.10	27.52	17.43	2021	Liu et al. [13]
GNN, LSTM	Attentive gated graph neural network, visual relationship embedding	68.30	40.10	0	2020	Li et al. [110]

Table 4

An evaluation using the range of recall for structure based scene graph generation

Algorithm	Methods	Pred cls (R@100)	SGCL (R@100)	SGGen (R@100)	Year	Author
Hierarchical recurrent arch, shelf graph to image network	Deep auto-regressive model	MMD node: 0.37	Image Precision: 0.727	Recall: 0.714	2021	Sarthak et al. [103]
MOTIFS, VCTree, SG-Transformer	CogTree loss: SGG net, Bias based CogTree, debiasing CogTree loss	39.7	23.4	21.7	2021	Yu et al. [64]

Table 5

An evaluation using the range of recall for deep understanding based scene graph generation

Algorithm	Methods	Pred cls (R@100)	SGCL (R@100)	SGGen (R@100)	Year	Author
VGG, RPN, node embedding (N2N), E2E	Local context-aware architecture named relation transformer: hierarchical multi-head attention	68.5	43.7		2021	Rajat et al. [99]
Graph Attention Network	Sparse graph attention network and Relationship measurement Net	69.3	41.1	32.9	2022	Zhou et al. [51]
Object-relation GAT, Hyper-relationship GAT	Hyper relationship learning network: Graph attention network	66.9		34.9	2022	Yibing et al. [163]
FRCNN, ROI	Additive attention mechanism and Transfer knowledge	68	38.3	31.4	2022	He et al. [124]
FRCNN, TransE	Attentive relational Network: semantic transformation module, self-attention module, relation inference	61.3	40.4	–	2019	Qi et al. [80]
RPN, FRCNN, ROI, GCN	Multimodal Graph Inference Network: Multimodal Information extraction, Multimodal Feature inference	20.6	10	7.4	2021	Jingwen et al. [54]

Table 6

An evaluation using the range of recall for performance based scene graph generation

Algorithm	Methods	Pred cls (R@100)	SGCL (R@100)	SGGen (R@100)	Year	Author
	Hierarchical context network	68.8	37.30	31.20	2021	Guanghui et al. [44]
FRCNN	Total direct effect, casual inference	29.1	14.90	9.8	2020	Kaihua et al. [68]
GCN	Relation regularized network	26.1	15	11.20	2022	Haiyan et al. [47]
FRCNN, NMS, Graph Interaction head	Local to global interaction net, bidirectional relationship clan	68.7	40.50	31.40	2022	Sangmin et al. [117]
Pixel2Graph, MotifNet, KERN	Conformance recall, violation recall, non-violation recall	53.02	–	–	2020	Jie et al. [59]
FRCNN	compositional diversity of predicates	58.18	35.05	31.14	2022	Xingchen et al. [141]
ROI, Learning scheduler, FRCNN	Cognitive bias	38.44	21.87	17.24	2022	Chan et al. [136]
CNN, Euclidean distance, GRCNN	Self-supervision task-based model	68.10	39.80	31.20	2020	Tzu et al. [127]
GNN, RPN, ORD, GG, BI-RNN	Multimodal features	71.15	54.16	27.61	2020	Gayoung et al. [41]
FRCNN	Hierarchy guided feature learning, hierarchy-guided module	24.01	26.36	14.47	2020	Zhuo et al. [165]
FCN, Feature pyramid network, VGG	Rich and fair semantic extraction network: pseudo-siamese network	88.35	44.38	26.68	2020
RAND, NEIGH, GRAPH	Conditional generative adversarial network	22.40	4.5	–	2021	Boris et al. [16]
FRCNN	Lexical knowledge-aware memory network	18.50	7.11	0.49	2021	Li et al. [153]
IMP	Low frequency first unknown class selection scheme	0	17.93	14.57	2022	Motoharu et al. [81]

5. Applications and future trends of scene graph generation

5.1 Applications of scene graph generation

Generative scene graphs are a way of representing the spatial and semantic relationships between objects in a scene. They can be used for various tasks such as image synthesis, scene understanding, and image captioning. A generative scene graph consists of nodes that represent objects and edges that represent relations. For example, a scene graph for an image of a person riding a bike on a road could have nodes for the person, bike, road, and sky, and edges for riding, on, and above. To generate a scene graph from an image, one can use a neural network that predicts the objects and their attributes, as well as the relations and their types. To generate an image from a scene graph, one can use a Graph Neural Network that takes the scene graph as input and outputs an image that matches the given description. These kinds of generation methods could improve image search engines and image generation for the training purposes of the model according to the user requirements. This could reduce the time for the searching experience on the internet and the performance of the retrieval of the image much simpler and easier for the systems. Based on our analysis Fig. 6 shows the generation of different data by the single centralized scene graph generator and we have segregated the types based on the data types such as text, image, and video by scene graph generation methods.

5.1.1 Text generation by scene graph

Scene graph generation is a state-of-the-art system that can generate natural language texts from keywords, prompts, queries, or scene graphs. It uses a large-scale neural network model that is trained on a diverse corpus of texts from various domains and languages. It can produce texts with different distinct characteristics such as context, tone, length, format, and style, by adjusting its parameters and hyperparameters. It can also suggest the users for writing and improving the quality of the creations according to their contexts. It enables the productivity of the users to be more creative and deeper informative. In some applications in our daily routines, we are using these kinds of text recommendations or generation based on the user’s keywords. According to our survey, we have the following research articles based on text generation.

Figure 6.

Example scene graphs data generations.

Lawrence et al. [27] developed a method to learn visual features for semantic phrases from sentences using Conditional Random Field (CRF) [112, 24, 132] formulation. The model extracts predicate tuples containing nouns and relations and determines the CRF’s potential using the extracted sentences. CRF is a statistical modeling method that combines classification and graphical modeling, leveraging multivariate data and input features for prediction. The method generates scenes based on semantic meaning, making it a significant step forward in computer vision. CRF is also used as a metric to score a set of scenes for a text-based image retrieval task [116]. Sahand et al. [114] the authors present a scene graph classification framework trained on annotated images and symbolic data. scene graph classification classifies objects and their relations using a text-to-graph module. The model adjusts the classification pipeline with text knowledge, generating more precise results in scene graph classification, object classification, and predicate classification. Maximilian et al. [78] the article introduces a model that uses detected objects and auto-generated visual relationships to recognize images in natural language. The model recognizes individual components and their visual relationships, producing a scene graph from raw image pixels. The final caption is produced by the graph-to-text model using the scene graph as input. The system has two parts one is the scene graph generation module which generates the scene graphs and the graph-to-text part which uses the attention mechanism based LSTM decoder. The model’s superiority over conventional image captioning approaches is demonstrated in the newly generated dataset.

5.1.2 Video generation by scene graph generation

Video generation by scene graph generation is a new technique that creates realistic, diverse videos from image and text keywords. It guides video generation by scene structure and context, resulting in more coherent, consistent, and natural videos. This technique has potential applications in entertainment, education, and security, enhancing learning and teaching, and enhancing surveillance and forensics.

Shengyu et al. [102] introduced the Dynamic Scene Graph Detection Transformer (DSG-DETR) method for generating dynamic scene graphs from videos. The method captures long-term temporal dependencies between objects and their relationships, using transformers and modelling relationship transitions. Experimental results show that DSG-DETR outperforms the existing methods on the AG dataset. Prateksha et al. [95] the Recipe2Video model converts documents into multimodal illustrative videos, improving user consumption experience. The model uses re-ranking and retrieval methods to select the best images for the recipes. They incorporated a Viterbi-based optimization algorithm [46] to create videos with visual cues, text, and voice-overs. The model captures semantic and sequential information and optimizes performance for seamless transitioning videos. Nag, Sayak et al. [111] introduce a TEMPURA framework for generating balanced scene graphs from videos. They addressed challenges like long-tailed distribution, noisy annotations, and temporal fluctuations. They incorporated a Mixture Density Network on top of the neural network which increases predictive uncertainty and noise in data. They also introduce a memory-guided training strategy to debias the predicate embeddings. Chen et al. [70] developed a new method for STSGG using a Slow-Fast Local-Aware Attention network it addresses the issues like the inability to distinguish between dynamic and static relations and inaccurate tail predicate classifications. The method achieves state-of-the-art results on the AG and ImageNet Video datasets, enhancing feature discrimination and potential applications in computer vision, robotics, and Artificial Intelligence.

5.1.3 Image generation by scene graph generation

Image generation using scene graphs is a challenging task that aims to create realistic and diverse images from structured representations of objects and their relationships. The main challenge is preserving consistency between input and output images, respecting attributes, spatial arrangements, and complex interactions. There are two main methods: direct and indirect. Both methods have advantages and disadvantages. Image generation is useful for various applications, including computer vision, computer graphics, Natural Language Processing (NLP), and Augmented Reality (AR)/Virtual Reality (VR), where scene graphs provide a high-level understanding of visual content and enable tasks like image retrieval, scene parsing, Video Graphics Array (VGA), and computer graphics.

Liu et al. [35, 34] the article is about Scene Sketcher a GCN-based architecture for fine-grained scene-level sketch-based image retrieval (SBIR). It generates realistic scenes from simple sketches, adds colors, textures, and lighting effects, and fuses multi-modality information between query sketches and target images. The model is trained using triplet and end-to-end methods, and its flexible graph feature learning allows for generalization to different scene data. SceneSketcher is ideal for artists, designers, students, and anyone seeking to unleash their creativity. Rishi et al. [98] article introduces a new scene expansion task using an auto-regressive model called the Graph Expansion Model for Scenes (GEMS). GEMS generates hierarchically dependent nodes and edges and introduces a cluster aware Breadth First Search (BFS) method for object co-occurrence. Experiments show GEMS outperforms graph synthesis tasks and GraphRNN based models, providing creative photographers with recommendations for diverse, rich scenes with desired seed concepts. Vivek et al. [131] the authors introduce CNN2GNN and CNN2Transformer, two methods for image classification using inter-example information. They generate a latent space bipartite graph using GNNs to calculate cross-attention scores between input images and a proxy set. Proxy sets contrastively learn class-level global information and are incorporated into feature representations. The CNN2GNN method improves image classification performance, allowing graph construction from arbitrary datasets and using proxies for class-level global information. This approach is useful in various applications, including object recognition, scene understanding, and image retrieval. Umair Hassan et al. [81] the article compares scene graph and layout-based image generation models, revealing that layout-based models generate more realistic, detailed images, better capturing spatial relationships and interactions. Image generation from scene graphs and layouts has a wide range of real-time applications. Some of these include:

•
Content Creation – These models can be used to generate images for use in advertising, marketing, and other creative industries.
•
Virtual and Augmented Reality – Creating a realistic virtual environment based on the user’s perceptions and needs.
•
Gaming – These models can be used to generate realistic game environments and characters.
•
Education – creating educational materials such as diagrams and illustrations.

Gaurav et al. [43] the authors developed a method to generate images incrementally using graphs of scene descriptions, preserving context, and generating consistent images over time. This approach generates high-quality real-world scenes with multiple objects, impacting fields like Robotics, Artificial Intelligence, Design, and Image retrieval. It doesn’t require intermediate supervision and is applicable to real-world images, making it useful for various applications. Azade et al. [2] introduce a meta-learning approach that adapts a model to different scenes and improves the image quality on a diverse variety of tasks. The experimental results show the performance of the model in image quality and semantic relationships. Chenyang et al. [21] present a CNN that learns image-to-graph translation tasks without external supervision using a self-supervised approach. This self-supervised approach encodes graph nodes and edges, offering benefits for intelligent agents in scene understanding and high-level reasoning. Kim et al. [109] the authors propose a semantic scene graph generation method using the Resource Description Framework (RDF) model and deep learning techniques. This approach clarifies semantic relations between objects in images by enabling efficient finding and classification algorithms with addition of meaningful information on image content as nodes and edges of a graph. Justin et al. [57, 58] the article presents a flexible, semantic-based method for retrieving images using scene graphs, improving object localization and accuracy in various applications, including digital assets management and visual search engines. Aiswarya et al. [121] state that the proposed framework generates the scene graph from images by using depth and spatial information of the object pairs. The framework predicts the relations and attributes of object pairs directly from images, without using any external knowledge or text descriptions. The major application benefits of this model are:

Improved image understanding – scene graphs provide a semantic structural representation of the objects, relationships, and attributes in an image, which improves the deeper understanding of the image with the relation connectivity across the objects.

Enhanced image retrieval – scene graphs can be used to effectively retrieve images based on constrained learning [28] or guided content. So, it taps the responsible prediction and recommendation of information for humans to get the appropriate results.

Improved image captioning – Because of deeper understanding and semantic structural representation of entities it is much more efficient for Artificial Intelligence models to describe the image with deep knowledge.

Overall application and use cases of the scene graph generation methods have been seamlessly advanced in the recent advancements in real-time applications such as task planning and action predictions for robots, topological land scanning for unknown space estimation, a conceptual reminder for many local tasks, and point cloud estimations. Because current semantic or estimation-based prediction models are not sufficient for many unhuman tasks and can predict wrongly. For example, recently in the Economic field of the world, many of the advanced risk prediction models have failed due to unknown factors like corona viruses which impact every single unit of industry in this world. In the future, scene graph techniques could elevate the performance and possibilities in every Artificial Intelligence-involved field like automation industries, product development industries, and Bio-medical industries could use this technique to enhance the production of their products in a very effective and efficient manner for a sustainable period.
5.2 Future directions

Scene graph generation’s major focus is to extract the relationship among the different entities from the input data and represent them in a graph structure. Currently, scene graph generation has a lot of research in multiple aspects of work like optimizing the existing and using the techniques in other domain problems. But still, there are many directions of worthy attention needed to use this technique for trustworthy prediction for long time periods. Based on the analysis and gaps in the research works it has many significant roles in all aspects of the domains such as:

Deep image understanding tasks like image generation based on users’ inputs, image captioning, image retrieval, visual reasoning, and dynamic generation of image structures from construction places. Dynamic action based tasks such as robot navigation predictions, dynamic task planning for modern robots, and dynamic action estimations. 3D virtual environment creation for the robot’s training [37] and simulation tasks such as automatic evaluation of products, and task completions by robots. Knowledge and drug discovery by using graph formations and relationship prediction stores every single unit of inputs gathered from the sources. It enables the possibilities for the unknown place evaluation and action predictions based on the gained knowledge. Social relationship detection is for detecting human-object and human-human interactions is crucial for scene graphs, and these relationships can be extended to detect social relationships. This research direction aims to understand scenes more deeply, and scene graph generation models can mine unseen social relationships from large-scale datasets, offering practical applications. Multi-modal ability to handle multiple tasks of taking different sources of types of data for the generation of common entity graphs can significantly balance the distribution and challenges of the data types. Pei et al. [60] have done cross-modality-based attention framework that can match the text with the scene inferences of the images. This technique will enable the centralized formation of knowledge for the organizations. Modern methods for scene graph generation Mainstream scene graph generation methods rely on object classification, detection, and recognition. However, current scene graph datasets and relationship prediction models face limitations. To improve prediction abilities, online learning, reinforcement learning, active learning, language model integration, and explainability of the predictions could be introduced into future scene graph generation methods. These kinds of strategies will enable the trustability and responsible way of the predictions while deploying in the real world.

6. Conclusion

The study of scene graphs is expanding quickly, and there are many potential applications. It tries to enhance comprehension and reasoning of more complex visual scenes. Current research, however, needs more development and investigation because it is not yet accurate meaning that it needs more language and context knowledge. In this analysis, many scene graph generation techniques are considered. 170+ research papers are gathered in this survey and categorized according to various approaches: Structure based scene graph generation, prior knowledge based scene graph generation, deep understanding based scene graph generation, and optimization based scene graph generation. In addition, a variety of resources were utilized to compile these research papers for this review and the difficulties encountered by current delving are described and evaluated in research papers. From this survey, we clearly explained the motive for choosing the generation of scene graphs and it helps analysts to develop new techniques related to the generation of scene graphs by addressing the drawbacks of remarks. Also, an assessment is presented utilizing the year of publication articles, toolset analysis, analysis of architecture for scene graph generation, dataset-based analysis, and performance evaluation. In this survey, the most regularly utilized methods are Optimization based methods for enriching the quality of the scene generation. Similarly, PyTorch is a recurrently used tool for detecting scene graph generation and the commonly used dataset is Visual Genome (VG) dataset. In addition, researchers widely used recall as a performance metric. The future scope will be focused on resolving the imbalance problem of the training data and the possibilities to emerge in multiple domain problems.

Footnotes

Author’s Bios

Monesh S is a Ph.D. candidate in the School of Computer Science Engineering and Information Systems at Vellore Institute of Technology, Vellore. His research focuses on Computer Vision and deep learning, with a particular interest in Graph learning algorithms. He has presented his work at international conferences and his research aims to develop a robust AI-capable system to assist human health.

Senthilkumar N C is a tenured professor in the School of Computer Science Engineering and Information Systems at Vellore Institute of Technology, Vellore. With over 20 years of experience in academia, his research area is in Database Management Systems, Web mining, Big Data, and Artificial Intelligence and he has published numerous research papers and book chapters in prestigious scientific journals. ORCID ID: 0009-0000-1876-3602

References

Airin

Dawla

R.U.

Noor

A.S.

Hasan

M.A.

Hasan

A.R.

Zaman

and Farid

D.M.

, Attention-Based scene graph Generation: A Review, 2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, Phnom Penh, Cambodia (2022).

Farshad

Musatian

Dhamo

and Navab

, Migs: Meta image generation from scene graphs, In Computer Vision and Pattern Recognition (2021).

Kuznetsova

Rom

Alldrin

Uijlings

Krasin

Pont-Tuset

Kamali

Popov

Malloci

Kolesnikov

Duerig

and Ferrari

, The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale, International Journal of Computer Vision 128(7) (2020), 1956–1981.

Milan

Leal-Taixe

Reid

Roth

and Schindler

, MOT16: A benchmark for multi-object tracking, In Computer Vision and Pattern Recognition (2016).

Newell

and Deng

, Pixels to graphs by associative embedding, Advances in Neural Information Processing Systems 30 (2017).

Prakash

Debnath

Lafleche

J.F.

and Camaracci

, Sim2SG: Sim-to-Real scene graph Generation for Transfer Learning, In the proceeding of International Conference on Learning Representations (2020).

Rosinol

Gupta

Abate

Shi

and Carlone

, 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans, In the proceeding of Robotics Science and Systems, Corvalis, Oregon, USA (2020).

Rosinol

Violette

Abate

Hughes

Chang

Shi

Gupta

and Carlone

, Kimera: From SLAM to spatial perception with 3D dynamic scene graphs, The International Journal of Robotics Research 40 (2021), 1510–1546.

Tripathi

Mishra

and Chakraborty

, Grounding scene graphs on Natural Images via Visio-Lingual Message Passing, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

10.

Zareian

Karaman

and Chang

S.F.

, Bridging knowledge graphs to generate scene graphs, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer International Publishing 12368 (2020): 606–623.

11.

Zareian

Wang

You

and Chang

S.F.

, Learning visual commonsense for robust scene graph generation, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer International Publishing 12368 (2020): 642–657.

12.

Liu

A.A.

Tian

Nie

Zhang

and Kankanhalli

, Toward region-aware attention learning for scene graph generation, IEEE Transactions on Neural Networks and Learning Systems 33(12) (2021), 7655–7666.

13.

Liu

A.A.

Wang

Liu

and Li

, Scene-Graph-Guided message passing network for dense captioning, Pattern Recognition Letters 145 (2021), 187–193.

14.

Dai

Zhang

and Lin

, Detecting visual relationships with deep relational networks, Proceedings of the IEEE conference on computer vision and Pattern recognition, Honolulu, HI, USA (2017), 3298–3308.

15.

Knyazev

Vries

H.D.

Cangea

Taylor

G.W.

Courville

and Belilovsky

, Graph density-aware losses for novel compositions in scene graph generation, arXiv preprint arXiv:200508230. (2020).

16.

Knyazev

Vries

H.D.

Cangea

Taylor

G.W.

Courville

and Belilovsky

, Generative compositional augmentations for scene graph prediction, Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada (2021).

17.

Lin

Zhu

and Liang

, Atom correlation based graph propagation for scene graph generation, Pattern Recognition 122 (2022), 108300.

18.

Wen

Luo

Liu

and Huang

, Unbiased scene graph generation via rich and fair semantic extraction, In Computer Vision and Pattern Recognition (2020).

19.

Agia

Jatavallabhula

K.M.

Khodeir

Miksik

Vineet

Mukadam

Paull

and Shkurti

, Taskography: Evaluating robot task planning over large 3D scene graphs, Conference on Robot Learning. PMLR (2022), 46–58.

20.

Liu

and Shen

, Towards View-invariant and Accurate Loop Detection Based on scene graph, In the proceeding of IEEE International Conference on Robotics and Automation (ICRA) (2023).

21.

and Dubbelman

, Image-Graph-Image Translation via Auto-Encoding, arXiv preprint arXiv:201205975. (2020).

22.

Krishna

Bernstein

and Fei-Fei

, Visual relationship detection with language priors, Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing (2016).

23.

Patil

and Abhyankar

, Generating comprehensive scene graphs with integrated multiple attribute detection, Machine Vision and Applications 34(11) (2023).

24.

Sutton

and McCallum

, An introduction to conditional random fields, Foundations and Trends®in Machine Learning 4(4) (2012), 267–373.

25.

Szegedy

Ioffe

Vanhoucke

and Alemi

, Inception-v4, inception-resnet and the impact of residual connections on learning, Proceedings of the AAAI conference on artificial intelligence, San Francisco California USA 31(1) (2017), 4278–4284.

26.

Yan

Chang

Guan

Zhu

and Zheng

, Zeronas: Differentiable generative adversarial networks search for zero-shot learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(12) (2021), 9733–9740.

27.

Zitnick

C.L.

Parikh

and Vanderwende

, Learning the visual interpretation of sentences, Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia (2013).

28.

Liu

Bober

and Kittler

, Constrained structure learning for scene graph generation, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023), 11588–11599.

29.

Zhu

Choy

C.B.

and Fei-Fei

, Scene graph generation by iterative message passing, Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA (2017).

30.

Hudson

D.A.

and Manning

C.D.

, Gqa: A new dataset for real-world visual reasoning and compositional question answering, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA (2019).

31.

Kim

D.J.

T.H.

Choi

and Kweon

I.S.

, Dense relational image captioning via multi-task triple-stream networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11) (2021), 7348–7362.

32.

Aksoy

E.E.

Abramov

Wörgötter

and Dellen

, Categorizing object-action relations from semantic scene graphs, 2010 IEEE International Conference on Robotics and Automation. IEEE, Anchorage, AK, USA (2010).

33.

Chollet

, Xception: Deep learning with depthwise separable convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA (2017), 1800–1807.

34.

Liu

Zou

Deng

Zuo

Lai

Liu

Y.J.

and Wang

, Scenesketcher: Fine-grained image retrieval with scene sketches, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer International Publishing 12364 (2020), 718–734.

35.

Liu

Deng

Zou

Lai

Y.K.

et al. SceneSketcher-v2: Fine-grained scene-level sketch-based image retrieval using adaptive GCNs, IEEE Transactions on Image Processing 31 (2022), 3737–3751.

36.

Lyu

Feng

and Wang

, vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding, Neurocomputing 413 (2020), 51–60.

37.

Xia

Zamir

Z.Y.

Sax

Malik

and Savarese

, Gibson env: Real-world perception for embodied agents, Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA (2018).

38.

Heilbron

F.C.

Escorcia

Ghanem

and Niebles

J.C.

, Activitynet, A large-scale video benchmark for human activity understanding, Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA (2015).

39.

Chen

Song

Zeng

and Jiang

, Scene recognition with prototype-agnostic scene layout, IEEE Transactions on Image Processing 29 (2020), 5877–5888.

40.

Huang

Liu

Maaten

L.V.D.

and Weinberger

K.Q.

, Densely connected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA (2017), 2261–2269.

41.

Jung

and Kim

, Multimodal context embedding for scene graph generation, Journal of Information Processing Systems 16(6) (2020), 1250–1260.

42.

Jung

Lee

and Kim

, Tracklet pair proposal and context reasoning for video scene graph generation, Sensors 21(9) (2021), 3164.

43.

Mittal

Agrawal

Agarwal

Mehta

and Marwah

, Interactive image generation using scene graphs, In Computer Vision and Pattern Recognition (2019).

44.

Ren

Liao

et al. Scene graph generation with hierarchical context, IEEE Transactions on Neural Networks and Learning Systems 32(2) (2020), 909–915.

45.

Zhu

Zhang

Jiang

Dang

Hou

Shen

Feng

Zhao

Miao

Shah

S.A.A.

and Bennamoun

, Scene graph generation: A comprehensive survey, In Computer Vision and Pattern Recognition (2022).

46.

Forney

G.D.

, The viterbi algorithm, Proceedings of the IEEE 61(3) (1973), 268–278.

47.

Gao

Shi

Jiang

Zhang

and Liu

, Scene graph generation with award-punishment strategy, Knowledge-Based Systems 251 (2022), 109239.

48.

Lee

Kim

and Lee

S.G.

, Multi-scale contrastive learning for complex scene generation, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

49.

Liu

Yan

Mortazavi

and Bhanu

, Fully convolutional scene graph generation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021).

50.

Zhang

Kyaw

Chang

S.F.

and Chua

T.S.

, Visual translation embedding network for visual relation detection, Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA (2017), 3107–3115.

51.

Zhou

Yang

Luo

Zhang

and Li

, A unified deep sparse graph attention network for scene graph generation, Pattern Recognition 123 (2022), 108367.

52.

Armeni

Z.Y.

Gwak

Zamir

A.R.

Fischer

Malik

and Savarese

, 3d scene graph: A structure for unified semantics, 3d space, and camera, Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea (South) (2019).

53.

Chen

and Wu

, Adaptive Image-to-Video scene graph Generation via Knowledge Reasoning and Adversarial Learning, Proceedings of the AAAI Conference on Artificial Intelligence 36(1) (2022).

54.

Duan

Min

Lin

and Xiong

, Multimodal graph inference network for scene graph generation, Applied Intelligence 51 (2021), 8768–8783.

55.

Zhao

Lin

Cai

and Ling

, Scene graph generation with external knowledge and image reconstruction, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA (2019), 1969–1978.

56.

Krishna

Fei-Fei

and Niebles

J.C.

, Action genome: Actions as compositions of spatio-temporal scene graphs, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA (2020).

57.

Johnson

Gupta

and Fei-Fei

, Image generation from scene graphs, Proceedings of the IEEE conference on computer vision and pattern recognition (2018).

58.

Johnson

Krishna

Stark

L.J.

Shamma

D.A.

Bernstein

M.S.

and Fei-Fei

, Image retrieval using scene graphs, Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA (2015).

59.

Luo

Zhao

Wen

and Zhang

, Explaining the semantics capturing capability of scene graph generation models, Pattern Recognition 110 (2021), 107427.

60.

Pei

Zhong

Wang

and Lakshmanna

, Scene graph semantic inference for image and text matching, ACM Transactions on Asian and Low-Resource Language Information Processing 22(5) (2023), 1–23.

61.

Wald

Dhamo

Navab

and Tombari

, Learning 3d semantic scene graphs from 3d indoor reconstructions, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA (2020).

62.

Wang

Yang

Liu

Zhu

and Liu

, M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER, IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2022), 111–120.

63.

Yang

Lee

Batra

and Parikh

, Graph r-cnn for scene graph generation, Proceedings of the European Conference on Computer Vision (ECCV) 11205 (2018), 690–706.

64.

Chai

Wang

and Wu

, CogTree: Cognition tree loss for unbiased scene graph generation, In the proceeding of Thirtieth International Joint Conference on Artificial Intelligence, Montreal, Canada (2020).

65.

Zhang

Shih

Tao

Catanzaro

and Elgammal

, An interpretable model for scene graph generation, In Computer Vision and Pattern Recognition (2018).

66.

Gao

Chen

Niu

Shao

and Xiao

, Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA (2022).

67.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA (2016).

68.

Tang

Niu

Huang

Shi

and Zhang

, Unbiased scene graph generation from biased training, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA (2020), 3713–3722.

69.

Lee

K.H.

Palangi

Chen

and Gao

, Learning visual relation priors for image-text matching and image captioning with neural scene graph generators, In Computer Vision and Pattern Recognition (2019).

70.

Chen

Cai

Wang

and He

, Video-based spatio-temporal scene graph generation with efficient self-supervision tasks, Multimedia Tools and Applications 82 (2023), 38947–38966.

71.

Leal-Taixé

Milan

Reid

Roth

and Schindler

, Motchallenge 2015: Towards a benchmark for multi-target tracking, In Computer Vision and Pattern Recognition (2015).

72.

Chen

Shi

Wang

Shao

Yang

and Xiao

, Label semantic knowledge distillation for unbiased scene graph generation, IEEE Transactions on Circuits and Systems for Video Technology 34(1) (2023), 195–206.

73.

and Chen

, Hierarchical graph attention network for visual relationship detection, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA (2020).

74.

Tao

Cheng

and Chen

, Predicate correlation learning for scene graph generation, IEEE Transactions on Image Processing 31 (2022), 4173–4185.

75.

Zhang

Yin

Hui

Liu

and Zhang

, Knowledge-Based scene graph Generation with Visual Contextual Dependency, Mathematics 10(14) (2022), 2525.

76.

Zhang

Wang

Sun

and Zhao

, Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge, Automation in Construction 142 (2022), 104535.

77.

Luigi

L.D.

Bolognini

Domeniconi

Gregorio

D.D.

Poggi

and Stefano

L.D.

, Scannerf: a scalable benchmark for neural radiance fields, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

78.

Mozes

Schmitt

Golkov

Schütze

and Cremers

, scene graph Generation for Better Image Captioning? In Computer Vision and Pattern Recognition (2021).

79.

Yang

Wang

and Luo

, Attentive relational networks for mapping images to scene graphs, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA (2019).

80.

Sonogashira

Iiyama

and Kawanishi

, Towards open-set scene graph generation with unknown objects, IEEE Access 10 (2022), 11574–11583.

81.

Hassan

M.U.

Alaliyat

and Hameed

I.A.

, Image generation models from scene graphs and layouts: A comparative analysis, Journal of King Saud University-Computer and Information Sciences 35(5) (2023), 101543.

82.

Khan

M.J.

Breslin

J.G.

and Curry

, Expressive scene graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning, European Semantic Web Conference, Cham: Springer International Publishing 13261 (2022), 93–112.

83.

Gkanatsios

Pitsikalis

and Maragos

, From Saturation to Zero-Shot Visual Relationship Detection Using Local Context, In the proceedings of 31st British Machine Vision Virtual Conference, BMVC (2020).

84.

Hughes

Chang

and Carlone

, Hydra: A real-time spatial perception system for 3D scene graph construction and optimization, In Robotics (2022).

85.

Silberman

Hoiem

Kohli

and Fergus

, Indoor segmentation and support inference from rgbd images, Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer Berlin Heidelberg 7576 (2012), 746–760.

86.

Liu

A.A.

Wong

Nie

and Kankanhalli

, Scene graph inference via multi-scale context modeling, IEEE Transactions on Circuits and Systems for Video Technology 31(3) (2020), 1031–1041.

87.

Chu

Wang

You

Ling

and Liu

, Transmot: Spatial-temporal graph transformer for multiple object tracking, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

88.

Dendorfer

Rezatofighi

Milan

Shi

Cremers

Reid

Roth

Schindler

and Leal-Taixé

, Mot20: A benchmark for multi object tracking in crowded scenes, arXiv preprint arXiv:200309003. (2020).

89.

Kochakarn

Martini

D.D.

Omeiza

and Kunze

, Explainable Action Prediction through Self-Supervision on scene graphs, IEEE International Conference on Robotics and Automation (ICRA), ExCeL London (2023).

90.

Zhang

Wulamu

Liu

and Chen

, Semantic relation model and dataset for remote sensing scene understanding, ISPRS International Journal of Geo-Information 10(7) (2021), 488.

91.

Ren

Xiao

Chang

Huang

P.Y.

Gupta

B.B.

Chen

and Wang

, A survey of deep active learning, ACM Computing Surveys (CSUR) 54(9) (2021), 1–40.

92.

Ren

Xiao

Chang

Huang

P.Y.

Chen

and Wang

, A comprehensive survey of neural architecture search: Challenges and solutions, ACM Computing Surveys (CSUR) 54(4) (2021), 1–34.

93.

Sun

Cao

Jiang

Zhang

Xie

Yuan

Wang

and Luo

, Transtrack: Multiple object tracking with transformer, In Computer Vision and Pattern Recognition (2020).

94.

Tian

and Jiang

, Scene graph generation by multi-level semantic tasks, Applied Intelligence 54 (2021), 7781–7793.

95.

Udhayanan

Laturia

Chauhan

Khandelwal

Petrangeli

and Srinivasan

B.V.

, Recipe2Video: Synthesizing Personalized Videos from Recipe Texts, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

96.

Dong

Liao

Zhang

Mahadevan

and Soatto

, Visual relationship detection using part-and-sum transformers with composite queries, Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada (2021), 3530–3539.

97.

Agarwal

Chandra

T.S.

Patil

Mahapatra

Kulkarni

and Vinay

, GEMS: Scene Expansion using Generative Models of Graphs, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

98.

Koner

Sinhamahapatra

and Tresp

, Scenes and surroundings: Scene graph generation using relation transformer, arXiv preprint arXiv:210705448. (2021).

99.

Krishna

Zhu

Groth

Johnson

Hata

Kravitz

Chen

Kalantidis

L.J.

Shamma

D.A.

Bernstein

M.S.

and Fei-Fei

, Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (2017), 32–73.

100.

Miao

Jia

and Sun

, Long-term robot manipulation task planning with scene graph and semantic knowledge, Robotic Intelligence and Automation 43(1) (2023), 12–22.

101.

Feng

Mostafa

Nassar

Majumdar

and Tripathi

, Exploiting long-term dependencies for generating dynamic scene graphs, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

102.

Garg

Dhamo

Farshad

Musatian

Navab

and Tombari

, Unconditional scene graph generation, Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada (2021).

103.

Ghosh

Gonçalves

and Das

, Im2Graph: A Weakly Supervised Approach for Generating Holistic scene graphs from Regional Dependencies, Future Internet 15(2) (2023), 70.

104.

Han

Liu

Zhang

Gong

Zhang

and He

, Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph, Complex and Intelligent Systems (2023), 1–18.

105.

Inuganti

and Vineeth

N.B.

, Assisting scene graph generation with self-supervision, arXiv preprint arXiv:200803555. (2020).

106.

Khandelwal

and Sigal

, Iterative scene graph generation, Advances in Neural Information Processing Systems 35 (2022), 24295–24308.

107.

Kim

Lee

and Yoo

H.J.

, A low-power graph convolutional network processor with sparse grouping for 3d point cloud semantic segmentation in mobile devices, IEEE Transactions on Circuits and Systems I: Regular Papers 69(4) (2022), 1507–1518.

108.

Kim

Jeon

T.H.

Rhiu

Ahn

and Im

D.H.

, Semantic scene graph generation using RDF model and deep learning, Applied Sciences 11(2) (2021), 826.

109.

Tang

Zhang

and Jiang

, Attentive gated graph neural network for image scene graph generation, Symmetry 12(4) (2020), 511.

110.

Nag

Min

Tripathi

and Roy-Chowdhury

A.K.

, Unbiased scene graph Generation in Videos, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada (2023), 22803–22813.

111.

Schuster

Krishna

Chang

Fei-Fei

and Manning

C.D.

, Generating semantically precise scene graphs from textual descriptions for improved image retrieval, Proceedings of the fourth workshop on vision and language (2015), 70–80.

112.

Sharifzadeh

Baharlou

S.M.

Schmitt

Schütze

and Tresp

, Improving scene graph classification by exploiting knowledge from texts, Proceedings of the AAAI Conference on Artificial Intelligence 36(2) (2022).

113.

Shit

Koner

Wittmaann

Paetzold

Ezhov

Pan

Sharifzadeh

Kaissis

Tresp

and Menze

, Relationformer: A unified framework for image-to-graph generation, European Conference on Computer Vision. Cham: Springer Nature Switzerland (2022), 422–439.

114.

Unar

Wang

Zang

and Wang

, Detected text-based image retrieval approach for textual images, IET Image Processing 13(3) (2019), 515–521.

115.

Woo

Noh

and Kim

, Tackling the challenges in scene graph generation with local-to-global interactions, IEEE Transactions on Neural Networks and Learning Systems 34(12) (2022), 9713–9726.

116.

Zhou

and Liu

, Hierarchical Context-Based Emotion Recognition With scene graphs, IEEE Transactions on Neural Networks and Learning Systems 35(3) (2022), 3725–3739.

117.

Zhao

and Peng

, Aligned visual semantic scene graph for image captioning, Displays 74 (2022), 102210.

118.

S.C.

Wald

Tateno

Navab

and Tombari

, Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA (2021).

119.

Aiswarya

S.K.

and Jyothisha

J.N.

, scene graph Generation Using Depth, Spatial, and Visual Cues in 2D Images, IEEE Access 10 (2021), 1968–1978.

120.

Chen

and Lin

, Knowledge-embedded routing network for scene graph generation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA (2019), 6156–6164.

121.

Gao

Song

and Li

Y.F.

, State-Aware Compositional Learning Toward Unbiased Training for scene graph Generation, IEEE Transactions on Image Processing 32 (2022), 43–56.

122.

Gao

Song

Cai

and Li

Y.F.

, Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation, In the proceeding of Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence (2020).

123.

Ohta

Tanaka

and Yamamoto

, Scene graph descriptors for visual place classification from noisy scene data, ICT Express 9(6) (2023), 995–1000.

124.

Verma

Agrawal

Vinay

and Chakrabarti

, Varscene: A deep generative model for realistic scene graph synthesis, International Conference on Machine Learning. PMLR 162 (2022).

125.

Wang

T.J.J.

Pehlivan

and Laaksonen

, Tackling the unannotated: Scene graph generation with bias-reduced models, In Computer Vision and Pattern Recognition (2020).

126.

Lin

T.Y.

Maire

Belongie

Bourdev

Girshick

Hays

Perona

Ramanan

Zitnick

C.L.

and Dollár

, Microsoft coco: Common objects in context, Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing (2014).

127.

Kim

U.H.

Park

J.M.

Song

T.J.

and Kim

J.H.

, 3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents, IEEE Transactions on Cybernetics 50(12) (2019), 4921–4933.

128.

Damodaran

Sharanya

Kumar

Anjana

Mitamura

Nakashima

Garcia

and Chu

, Understanding the role of scene graphs in visual question answering, Proceedings of the 16th International Symposium on Visual Information Communication and Interaction (2021) 1–8.

129.

Trivedy

and Latecki

L.J.

, CNN2Graph: Building Graphs for Image Classification, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023), 1–11.

130.

Cong

Wang

and Lee

W.C.

, Scene graph generation via conditional random fields, arXiv preprint arXiv:181108075. (2018).

131.

Wang

and Chen

, Topic scene graph generation by attention distillation from caption, Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada (2021).

132.

Wang

Shan

and Chen

, Sketching image gist: Human-mimetic hierarchical scene graph generation, European conference on computer vision, Cham: Springer International Publishing 12358 (2020), 222–239.

133.

Chang

Ren

Chen

and Hauptmann

, A comprehensive survey of scene graphs: Generation and application, IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1) (2021), 1–26.

134.

Chang

Wang

Sun

and Cai

, Biasing Like Human: A Cognitive Bias Framework for scene graph Generation, In Computer Vision and Pattern Recognition (2022).

135.

Dong

Gan

Song

Cheng

and Nie

, Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).

136.

Han

Yang

Zhang

Gao

and Zhang

, Image scene graph generation (sgg) benchmark, In Computer Vision and Pattern Recognition (2021).

137.

Hua

Wang

Rui

Shao

and Wang

, Adversarial reinforcement learning with object-scene relational graph for video captioning, IEEE Transactions on Image Processing 31 (2022), 2004–2016.

138.

Kan

Cui

and Yang

, Zero-shot scene graph relation prediction through commonsense knowledge integration, Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part II 21. Springer International Publishing (2021), 466–482.

139.

Chen

Shao

Xiao

Zhang

and Xiao

, Rethinking the evaluation of unbiased scene graph generation, In Computer Vision and Pattern Recognition (2022).

140.

Liang

Lee

and Xing

E.P.

, Deep variation-structured reinforcement learning for visual relationship and attribute detection, Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA (2017), 4408–4417.

141.

Lin

Ding

Zeng

and Tao

, Gps-net: Graph property sensing network for scene graph generation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA (2020).

142.

Lin

Zeng

and Li

, Divide and Conquer: Subset Matching for scene graph Generation in Complex Scenes, IEEE Access 10 (2022), 39069–39079.

143.

Shang

Xiao

Cao

Yang

and Chua

T.S.

, Annotating objects and relations in user-generated videos, Proceedings of the 2019 on International Conference on Multimedia Retrieval (2019), 279–287.

144.

Shang

Ren

Guo

Zhang

and Chua

T.S.

, Video visual relation detection, Proceedings of the 25th ACM international conference on Multimedia (2017), 1300–1308.

145.

and Xu

, Hierarchical image generation via transformer-based sequential patch selection, Proceedings of the AAAI Conference on Artificial Intelligence 36(3) (2022).

146.

Chen

Sun

Yuan

and Wu

, Zero-Shot scene graph Generation with Knowledge Graph Completion, 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Taipei, Taiwan (2022).

147.

Chang

Ballotta

and Carlone

, D-lite: Navigation-oriented compression of 3d scene graphs under communication constraints, IEEE Robotics and Automation Letters 99 (2022), 1–8.

148.

Cong

Yang

M.Y.

and Rosenhahn

, Reltr: Relation transformer for scene graph generation, IEEE Transactions on Pattern Analysis and Machine Intelligence 45(9) (2023), 11169–11183.

149.

Guo

Song

Gao

and Shen

H.T.

, One-shot scene graph generation, Proceedings of the 28th ACM International Conference on Multimedia (2020), 3090–3098.

150.

Guo

Gao

Song

Wang

Sebe

and Shen

H.T.

, Relation regularized scene graph generation, IEEE Transactions on Cybernetics 52(7) (2021), 5961–5972.

151.

Yang

Huang

and Xu

, Zero-shot predicate prediction for scene graph parsing, IEEE Transactions on Multimedia 25 (2022), 3140–3153.

152.

Liu

Long

Zhang

Liu

Zhang

Yin

and Yang

, Explore contextual information for 3d scene graph generation, IEEE Transactions on Visualization and Computer Graphics 29(12) (2022), 5556–5568.

153.

Rai

Chang

Knyazev

Shekhar

Taylor

G.W.

and Volkovs

, Context-aware scene graph generation with seq2seq transformers, Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada (2021).

154.

Qiu

Yamamoto

Yamada

Suzuki

Kataoka

Iwata

and Satoh

, 3D Change Localization and Captioning from Dynamic Scans of Indoor Scenes, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

155.

Qiu

Nagasaki

Hara

Kataoka

Suzuki

Iwata

and Satoh

, VirtualHome Action Genome: A Simulated Spatio-Temporal scene graph Dataset with Consistent Relationship Labels, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

156.

Ren

Choi

C.L.

and Leutenegger

, Visual-inertial multi-instance dynamic SLAM with object-level relocalisation, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE (2022).

157.

Tian

Carballo

and Takeda

, Road scene graph: A semantic graph-based scene representation dataset for intelligent vehicles, arXiv preprint arXiv:201113588, (2020).

158.

Wang

Gao

Guo

Wan

Yang

and Huang

, Transformer networks with adaptive inference for scene graph generation, Applied Intelligence 53(8) (2023), 9621–9633.

159.

Wanyan

Yang

and Xu

, Dual scene graph convolutional network for motivation prediction, ACM Transactions on Multimedia Computing, Communications and Applications 19(3) (2023), 1–23.

160.

Yang

Alazab

Kumar

and Han

, Integrating Multihub Driven Attention Mechanism and Big Data Analytics for Virtual Representation of Visual Scenes, IEEE Transactions on Industrial Informatics 18(2) (2021),1435–1444.

161.

Zhan

Chen

Tao

and Luo

, Hyper-relationship learning network for scene graph generation, arXiv preprint arXiv:220207271. (2022).

162.

Zhang

Pan

Yao

Huang

Mei

and Chen

C.W.

, Boosting scene graph generation with visual relation saliency, ACM Transactions on Multimedia Computing, Communications and Applications 19(1) (2023), 1–17.

163.

Zhou

Sun

Zhang

and Ouyang

, Exploring the hierarchy in relation labels for scene graph generation, arXiv preprint arXiv:200905834. (2020).

164.

Chen

Rezayi

and Li

, More Knowledge, Less Bias: Unbiasing scene graph Generation with Explicit Ontological Adjustment, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA (2023).

165.

Feng

Zheng

and Cai

, Knowledge-Enhanced scene graph Generation with Multimodal Relation Alignment (Student Abstract), Proceedings of the AAAI Conference on Artificial Intelligence 36(11) (2022).

166.

Feng

and Ruan

, Dual Attention Message Passing Model for scene graph Generation, 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS). IEEE, Dali, China (2019).

167.

Lin

Zhu

Wang

Kong

Wang

Huang

and Hao

, RSSGG_CS: Remote sensing image scene graph generation by fusing contextual information and statistical knowledge, Remote Sensing 14(13) (2022), 3118.

168.

Lin

Zhu

Kong

Wang

and Wang

, SRSG and S2SG: a model and a dataset for scene graph generation of remote sensing images from segmentation results, IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–11.

169.

Ravichandran

Peng

Hughes

Griffith

J.D.

and Carlone

, Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks, 2022 International Conference on Robotics and Automation (ICRA). IEEE, Philadelphia, PA, USA (2022).

170.

Wang

You

L.H.

Zareian

Park

Liang

Chang

K.W.

and Chang

S.F.

, SGEITL: Scene graph enhanced image-text learning for visual commonsense reasoning, Proceedings of the AAAI Conference on Artificial Intelligence 36(5) (2022).

171.

Yang

Pan

and Qin

, Scene-graph-driven semantic feature matching for monocular digestive endoscopy, Computers in Biology and Medicine 146 (2022), 105616.

172.

Zheng

and Feng

, Subgraph and object context-masked network for scene graph generation, IET Computer Vision 14(7) (2020), 546–553.

173.

Hung

Z.S.

Mallya

and Lazebnik

, Contextual translation embedding for visual relationship detection and scene graph generation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(11) (2020), 3820–3832.

Review on scene graph generation methods

Abstract

Keywords

1. Introduction

2. Scene graph generation

2.2 Prior knowledge based scene graph generation

2.3 Deep understanding based scene graph generation

2.4 Optimization based scene graph generation

3. Bibliographic analysis

3.1 Analysis based on dataset

Table 1 Combination of VG dataset and other dataset counts

3.1.3 3D Datasets

3.1.4 Simulated datasets

3.1.5 Custom-built datasets

Table 2 The statistics of common scene graph datasets

4.1 Methods for evaluation

4.2 Common metrics

4.3 Quantitative performance

Table 3 An evaluation using the range of recall for prior knowledge based scene graph generation

5.1 Applications of scene graph generation

5.1.1 Text generation by scene graph

5.1.3 Image generation by scene graph generation

6. Conclusion

Footnotes

Author’s Bios

References

Table 1
Combination of VG dataset and other dataset counts

Table 2
The statistics of common scene graph datasets

Table 3
An evaluation using the range of recall for prior knowledge based scene graph generation