Abstract
Video grounding intends to perform temporal localization in multimedia information retrieval. The temporal bounds of the target video span are determined for the given input query. A novel interactive multi-head self-attention (IMSA) transformer is proposed to localize an unseen moment in the untrimmed video for the given image. A new semantic-trained self-supervised approach is considered in this paper to perform cross-domain learning to match the image query – video segment. It normalizes the convolution function enabling efficient correlation and collecting of semantically related video segments across time based on the image query. A double hostile Contrastive learning with Gaussian distribution parameters method is advanced to learn the representations of video. The proposed approach performs dynamically on various video components to achieve exact semantic synchronization and localization among queries and video. In the proposed approach, the IMSA model localizes frames greatly compared to other approaches. Experiments on benchmark datasets show that the proposed model can significantly increase temporal grounding accuracy. The moment occurrence is identified in the video with a start and end boundary ascertains an average recall of 86.45% and a mAP of 59.3%.
Keywords
Introduction
The fast growth of video data necessitates the development of effective indexing and retrieving services. Video grounding can be applied for many applications such as video surveillance [5], video retrieval systems, sports video analysis, automated driver systems, etc. Many researchers are focussing the video domain on surveillance systems. With the increasing growth of video data, analyzing compressed video content typically begins by dividing the video into multiple related frames known as Groups of pictures (GOP), from which one or more keyframes or indicative frames can be recovered for each GOP. Broadcasters broadcast the highlights of the sports event at different time intervals. These highlight events are detected and clustered [20] based on the representative clips concerning skimming time. The similarity is measured between the query and video clips for retrieval. A distance metric such as hamming distance, Euclidean distances, and zero normalized correlation coefficient (ZNCC) is computed to measure the similarity among the frames consisting of images. Contrastive loss, triplet loss, and quadruplet loss are computed to train the similarity learning model. This method focuses on pairs, ensuring that the distance between an anchor and a positive sample is closer than to a negative one by a set margin. It also brings samples of the same class closer to their class centroids in the embedding space, improving representation of similarities and differences. There exists a lack of temporal consistency among the successive frames of the video.
The complexity of modern deep learning algorithms has increased greatly, allowing them to extract characteristics and make significant connections; however, this comes at the expense of increased time and resource requirements. In contrast to algorithms used for picture classification and object recognition, which may do multiple tasks simultaneously, the tracking method does not. It’s a multi-purpose algorithm that can find things (object detection), put them in categories (localize), and maintain tabs on them (classification). This method is quite difficult to train theoretically and requires a considerable amount of effort. Additionally, it is important to remember that the tracking algorithms’ speed during inference time is a critical factor in the reliability of the data they provide. For real-time object tracking models, boosting tracking performance is very crucial.
Many deep learning techniques such as Convolutional neural network (CNN), and feed-forward neural network (FFN), are used to extract the features from the given image and video. These networks train themselves to learn the features of image or video data without labels which are known as self-supervised learning models. Video localization includes two major problems such as essential representative frame selection and selecting the frames consisting of maximum semantic values. This work concentrates on video moment localization to address these key issues. To retrieve the video and localize a particular frame from a large-scale database, existing pooling methods have a high computational cost. The temporal localization of frames based on image query with high accuracy is a challenging task. Transformer network models are used to convert the extracted features into their context vector representatives with the use of encoders and decoders. Recurrent neural networks (RNN) such as Long-Short Term Memory (LSTM) are preferred to process and do predictions with sequential data. Attention mechanism (Guo et al., 2021) is introduced in the transformer to implement cross learning model from the outcomes of deep neural network (DNN) models. Attention is done in the encoder-decoder layers [16, 32] to concentrate more on visual features and their embeddings.
Contrastive learning (CL) is another approach to self-supervised learning which is being emerged as an additional efficient method for trained deep neural networks. Unlike generative models, CL is a discriminative method that tries to group similar data and varied ones apart. Figure 1 shows the self-supervised learning pipeline to perform contrastive learning. A model to self-train itself first augments the input data. One instance from the training sample is chosen and a changed variant of the data is obtained using adequate data augmentation techniques. Then the augmented data is passed to the encoder and from those encoded vectors, a pair of images are chosen (positive or negative) to compare the similarity of the embeddings. A similarity metric is being used to compare the similarity of two embeddings. A contrastive loss is estimated based on the feature representations of the pictures recovered from an encoder network, especially for computer vision tasks.

Self-supervised training pipeline.
Video retrieval tasks are performed for the given text query or a clip of video. Many works use holistic approaches such as Euclidean distance [2], similarity estimation, and ranking scores. These scores are used to map the embedding space of video and text. Retrieving corpora of video moment is the extension of single moment retrieval in the video. LSTM is used to transfer the video spatial information into flat representation. But there exists a loss in spatial information [15, 33]. Hence, a spatial encoder or attention mechanism is needed to obtain the spatial relationship between video and query. A spatial attention mechanism is to be created to learn frame regions to pay attention to each word in the query.
In the proposed research work, for a given image query, the semantically related occurrence of the query image is detected in the given video and the moment sequence-sensitive localizer is framed to localize the contender video moment for retrieval. A multi-head self-attention transformer model is developed to compute the attention score of visual words in the video. The self-supervised contrastive learning is used in the proposed work to localize the video moment for the given input query on computing the similarity as well as contrastive loss.
The major contributions are:
An Image-based moment Localization is proposed which uses an image query to locate unnoticed behaviours in full-length rough videos. Proposed a moment localization framework that uses interactive multi-head Self Attention (IMSA) mechanism with contrastive learning to develop the temporal progression of events based on input query. Intercept and transform the active spatiotemporal features of video and learn rich relational features. IMSA network outperforms convolutional networks.
Many video grounding works are rank-based or attention-based methods to localize the target video extent on computing the finest matching score. Video grounding performs action recognition or concept localization. It identifies all the action classes that occurs in the video sequence. Action recognition is more specific that falls under multi-class classification problem. An action may occur from frame A to frame B and the motivation is to determine the boundaries of those action occurrence. Reinforcement learning is also deployed to filter the video moment with temporal bounds.
Mallick et al. [18] detected the salient region and extracted the salient features of the extracted keyframes. Matching is performed between the query video features and salient features of the extracted key frames which are stored in repositories. Araujo and Girod [1] proposed an asymmetric approach to video retrieval using image queries. The video is segmented, and fisher vectors and embeddings are constructed with hashing based on the bloom filter. Then the image descriptors and video descriptors are compared to retrieve similar frames. Videos are represented using fisher vectors and bloom filters and are temporally aggregated for retrieval purposes. Optimization of fisher vectors needs improvement while performing an asymmetric comparison. Peng and Ngo [20] implemented clip-based similarity measurement for video retrieval. The irrelevant frames are filtered out, and similarity matching (consisting same visual information) is done between the query clip and video. On maximizing the similarity, the clustering of frames is performed and retrieved. The computation complexity is based on the number of shots and edges that match. Rahmani and Zargari [21] introduced a prediction unit size feature vector on B and P frames. Using motion vectors and color histogram, retrieval is performed, and ranking is done on the retrieved video. Lin et al. (2020) extracted the features of video using deep CNN and retrieved the videos based on the preserving similarity of the spatiotemporal relationship in pooling frames. Adam optimizer has been used to update the parameters. The rank-based structured loss function is computed with L2 normalization and Hamming distance, mean average precision is the evaluation metric used.
Sentence query-based video retrieval
Shen et al. [22] performed temporal hashing along with the gated recurrent units of 10 layers to retrieve the video on estimating hamming ranking criteria for searching. Yu et al. [27] implemented a multilinear pooling method to perform video grounding for the given sentence query. The text and video features are integrated with multitask learning and losses were regularized. Wang et al. [24] developed an interactive transformer to model the text query and video pair interactions and predict the temporal boundaries in the video. Yang et al. [25] developed a local correspondence network to video moments by hierarchically representing features and modelling the target moments by relating the correspondence of local features. This models GPU memory usage. Gao et al. (2020) performed video knowledge transfer among graph-based neural networks and the embedding space to retrieve the moments in a video for the given sentence. Zhang et al. [30] provided an excellent survey on temporal grounding tasks on a video.
Action recognition
Gao et al. [7] proposed a cross model for temporal activity localization from an untrimmed video based on the query. Temporal regression has been concentrated on computing the alignment scores and clip regressions. Jaiswal et al. [12] have provided an extensive survey on contrastive learning for self-supervised learning models. Zhang et al. [29] have implemented contrastive learning along with the unimodal encoding method to retrieve the queried moment from the video. Interactions between text and visual features are performed and fine-grained video retrieval is done. Chen et al. [4]implemented a multimodal network for temporal grounding. Table 1 shows the previous works of contrastive learning with their pros and cons.
Contrastive learning.
Contrastive learning.
Video understanding involves variants such as convolution, self-attention, motion learning, etc. Convolution and attention aim to understand the video by capturing its temporal dynamics throughout several frames [6]. Spatiotemporal convolution is a 3D convolutional model attached with self-attention for better video understanding and semantic-based video retrieval tasks. Though vision transformers are designed for this purpose, it lacks in relating the inter and intra features of video (spatial and temporal) to context vector. A video with input features (height, width, time) is transformed into a context vector using neighbour feature information. The transform function is applied on each position of the features X with a learnable weight W and is being mapped to a context vector Y. Convolution is the operation of translating the features to an equivariant form by applying kernels. The region of interest and its features are extracted using a convolution neural network with many kernels, but convolution persists stable. This does not affect the target visual but due to channel weight dependency, redundancy occurs, and complexity increases.
Self-attention is a dynamic transform mechanism consisting of encoder and decoders, that generates a context vector attention map using a target. It then accumulates the context using a dynamic kernel attention map. The attention mechanism has three computations such as query, key, and value for the given input and target. Using the learnable embeddings of Q, K, V matrices, the attention embeds the target into a query and then projects the context into key and value matrices. Attention map computation involves a scaled dot product of query and key matrices. Here the content attention is done with the positional embedding information. The self-attention [23, 14] transformer aggregates the context embedding values using the attention map as kernel weights and thus an interaction takes place between the context and the position. A SoftMax activation function is applied to the aggregated context vectors based on the target. It is flexible and consumes only a few parameters. SoftMax reduces the computational complexity and hence self-attention is used for video understanding, classification, and retrieval. Existing transforms lack providing context-to-context interactions and content information to position vector interactions. Thus, video understanding, and localizing a moment in the video is a crucial task as video consists of spatial and temporal contents.
Proposed methodology
An overview of the proposed Spatio-temporal multi-head attention moment Localization Model is illustrated in Figure 2. Initially, the model encodes the image and video separately with an attention mechanism. A set of untrimmed videos of any length consisting of a sequence of frames with optional subtitles is given as input to the proposed model. Each video is correlated with the image query that corresponds to a video segment representing or containing semantically similar image information. For the given image query Q, the main task is to retrieve the temporal moment with a boundary of starting frame position and ending position of video frames. A unimodal approach is proposed to perform efficient video moment localization and retrieval with high accuracy. Unimodal techniques such as temporal segment networks and temporal action proposals analyze video data only, making them simpler but potentially less effective than multimodal approaches, which incorporate text, image, and video data for richer context. The position and input embedding are passed to the IMSA model explained in the following section. During the process, an interactive multi-head self-attention (IMSA) transformer model is simulated to determine the attention score and alignment score among the encoded features. The interactive multi-head self-attention transformer is a model designed for detailed retrieval by analyzing video content and query picture regions. It employs multiple attention mechanisms to pinpoint relevant information in both the video and query image, facilitating accurate matching and retrieval of pertinent content. The features are then normalized. Also, contrastive learning (CL) with contrastive loss estimation is computed at the moment sequence-sensitive localizer which provides better accuracy. Further, to enhance the performance, Gaussian Mixture Model (GMM) parameters are trained for localization among the representations of latent space. Parameter optimization in Gaussian Mixture Models (GMMs) for latent space representations aims to refine parameters such as means and covariances iteratively. This enhances the model’s ability to accurately capture data distribution nuances, improving localization of data points within the latent space.

Interactive Multi-head Self Attention (IMSA) for moment localization.
Each video V consists of many clips denoted as {Ci} of fixed lengths. Clip-level representations are used to learn semantic alignments which may consist of annotations for the candidate moments within the joint embedding space. Representing each clip individually reduces computational complexity, allowing focused analysis. Semantic alignments align clips with concepts, aiding accurate recognition and interpretation, improving task effectiveness in surveillance, action recognition, and content understanding. Each query image consists of many regions of interest (ROI). In the first stage of the model, faster CNN is employed to extract the features. Further, in the second stage, the localization of ROI is fine-tuned. C3D, a 3-dimensional convolutional neural network is used to extract the high-level features of the video. Similarly, the features of the given query image are also extracted using a faster convolution neural network.
Based on those regions having high precision, self-attention is performed on the video to detect semantically and localize the particular temporal moment for a certain window length. 4096-dimensional visual features are extracted from the fully connected layer and are convolved to 512-dimension using hyperbolic tangent functions with a weight parameter and, the visual embeddings are obtained. Positional encoding vectors and input embeddings of the video features are passed to the multi-head self-attention encoder transformer which consists of stacked conventional N layers. Extracted features are being projected to latent dimension d with weight and bias. A convolutional stacked attention-based transformer is used to project the moment sequence representations and the query representations in joint embedding space. CSAT combines convolutional layers to extract features from input sequences, followed by stacked attention mechanisms to emphasize important segments. Transformers are used to capture relationships between input parts, aiding in tasks like sequence understanding and prediction. Thus, an interactive multi-head self-attention transformer is employed to explore the relationship of those query image regions in the video for fine-grained retrieval.
The goal of the IMSA localizer is to predict the boundaries of unseen activities in the video containing contender-relevant temporal moments during the inference process. Multiple objects of the query are tracked in the video based on the object of interest. Each layer has an encoder, followed by layer normalization, a feed-forward network with normalization. Here a stack of three layers is used in the transformer block. The learning target for the queries is to explicitly focus on differentiating intra-video moments from inter-video global semantics. The algorithm draws a box around the object and uses this to determine its kind and location. Giving each thing its own identifier (ID). Continuously focusing on the detected object as it travels between frames and saving any pertinent data. Additive margin-based triplet loss and maximum margin-based triplet loss are used to train the model in two different performance improvements. Maximal margin-based triplet loss maximizes the margin between anchor-positive and anchor-negative pairs, promoting distinct embeddings for different classes. Additive margin-based triplet loss adjusts the margin dynamically based on similarity, potentially improving generalization. Algorithm 1 illustrates the steps involved in multi-attention for temporal localization. The localization with the most similar representation is retrieved from the video sequences during the inference stage for the given query.
Contrastive learning for Spatio-temporal Localization
Contrastive learning is used to learn the visual representations and features extracted from the encoders. It is used to maximize the mutual information among input data and visual representations. A small feed-forward neural network is used to map the encoders’ output representations to 128 dimensions of latent space. Contrastive loss and cross-entropy loss are calculated to determine the boundaries of the target moment in a video.
Batches of a specific size, N, are created from raw image and video to begin contrastive learning at the frame level. A stochastic IMSA transformation function is performed on each image in this batch to create a pair of two images. To obtain visual features, each enhanced image in a pair is sent through an encoder. After that, the representations of the two enriched images are sent via a non-linear dense layer, a ReLU, and yet another dense layer. Feature extraction and transformation play a crucial role in enhancing a model’s capability to recognize complex patterns and relationships within data. This process significantly boosts performance in tasks like image classification, object detection, and image segmentation. These images are projected into a representation after being passed through a sequence of these layers to apply non-linear encoding processing.
Embedding vector is obtained for every augmented image in the batch. Cosine similarity is used to determine similarity between two augmented representations of an image. In augmented image representations, features are typically depicted as vectors. Comparing these vectors using cosine similarity measures their directional alignment. Higher cosine values imply greater feature similarity, useful for tasks like image retrieval and classification. Then the contrastive loss is computed using the loss function. For each augmented pair of visual words of query and frame, apply the SoftMax function to estimate the probability based on similarity.
The encoder and projection head approximations gradually improve as a result of the loss, and the representations acquired bring similar visuals closer together in space. Contrastive learning outperformed earlier self-supervised approaches on ImageNet, according to the results.
Moment localization with contrastive learning
The encoded features of the given query
The clip is defined by a bounding window where the start and end are frame numbers. The similarity score between the given image query
Where
The starting score,
As layer normalization is performed, the feature vectors are being gaussian distributed. The probability distribution of the boundary with starting
The target moment boundaries are derived by maximizing the combined probabilities. Thus, the best boundaries are detected with the maximized scores.
Where,
The query is encoded into vectors
The similarity between the query and the video sequence elements is estimated using cosine similarities on the features extracted. The features include text features like TF-IDF vectors or embeddings from the query, visual features from video frames such as deep learning-based embeddings, and temporal features like LSTM or Transformer-based embeddings to capture temporal dynamics in the sequence. Cosine similarity is used to measure the cosine angle between two projected vectors. If the angle is smaller between these vectors, then the similarity is higher. It is calculated by finding the dot product of two vectors and dividing it by the product of its magnitude. Then cross-correlation is computed.
Here norm (.) is the L2 normalization. The similarity score is a scalar value. The maximum similarity score is considered for representing the highest match amongst query and video.
During the training of the video retrieval objective, a triplet margin loss function is adopted. The accurate identification of boundaries is achieved by minimizing loss functions, such as cross-entropy or mean squared error, which measure the deviation from ground truth labels. Through iterative optimization, the model learns to pinpoint the precise boundaries of the target instant. This function is highly used for content-based retrieval tasks. This
Where
The IMSA model is trained based on the similarity relationships known as contrastive loss. The embeddings are learned with contrastive loss. This loss function computes the similarity among objects using similarity functions. This loss function is used to optimize the nonlinear parameters. Combining these two losses, minimize the loss and learn to approximate the boundaries in the embedding space.
The time complexity and space complexity on training IMSA model are shown in Table 2. Here Batch size (S), for the given input size (N), context size (L), channel dimension (D), latent channel dimension and number of queries (Q) are used to train an IMSA network.
Complexity measure.
The video retrieval function anticipates that the video has a similar moment to the query, hence frame-wise contrastive learning emphasizes moment specificity inside a particular pair of the video-image query. Video retrieval functions are crucial for quickly finding and accessing specific video content. They enable users to search within videos based on keywords, topics, or even visual elements, enhancing the efficiency of video content management and enabling more precise content discovery. The visual features that are within the target moment’s boundary are considered forefront or positive samples, while others are considered background or negative samples. Addressing class imbalance in datasets involves various strategies such as oversampling the minority class, augmenting its samples, utilizing robust algorithms like Random Forest or AdaBoost, adjusting misclassification costs, and employing techniques like SMOTE for generating synthetic samples. The contrastive loss is then determined by calculating mutual information among both the query and the optimistic video features. Contrastive loss distinguishes images based on their similarity. A similarity score is calculated by comparing the feature or latent layer to the target using a similarity metric and training with the target. When both inputs are the same (positive pairs), the target is 0. Collaborating contrastive learning with Gaussian Mixture Model-Expectation Maximization (GMM-EM) is performed to maximize the likelihood to localize the video grounding process.
GMM parameters estimation
For recognizing moving objects, the Gaussian Mixture Model (GMM) is a well-known ambient modelling technique. Usually, GMM is used to cluster the data represented as a posterior probability distribution. All data points are created using a composite of bounded Gaussian distributions having uncertain parameters in this method. When dealing with uncertain parameters, like mean and variance, common methods include maximum likelihood estimation and Bayesian inference. These approaches account for constraints and provide estimates that align with the data, making bounded Gaussians a good choice for modeling such data points. The GMM technique creates models for each pixel using numerous flexible Gaussian distributions and updates the model using an online approximation. The threshold (T) and learning rate (
You et al. [26] have introduced the hard GMM to the self-attention transformer to improve the quality of cross attention. But this hard-coded attention performance was less compared to the base transformer. Zhou et al. [31] captured the correlation among features using the likelihood of expectation-maximization algorithm involving GMM parameters and thus features are discriminated for the recognition task. Gaussian kernels control the mapped features sequential range.
GMM squeezes the probable outcomes into smaller regions. This indicates that it is not only limited to discrete but also continuous data. GMM is the weighted sum of Gaussian distributions of multivariate (K). It is expressed as:
Where
The parameters used in the algorithms are shown in Table 3.
Parameters used.
Likelihood regularization is related to the probability distribution. Segment features and embeddings are taken to predict combined segment-embedding relevance likelihood. Neighbourhood information is incorporated to give a smoothing effect on the outputs of the prediction probabilities. This helps to perform intercession in video grounding tasks with high accurate temporal boundaries inference. This is known as regression loss, which fine-tunes the distance among predicted moment chunks and ground truth ones moved from its boundaries. Contender generator eliminates the video chunks which are not relevant to the class. It also reduces the searching space using high recall transformer model. This improves the performance of likelihood computation and high precision for top K results.
Dataset
Experimental analysis is conducted on three different benchmarks datasets such as SumMe, Activity Captions, and charades-STA designed to localize the moments in the content-based video retrieval. Jie et al. (2020) discussed a large-scale dataset to perform moment retrieval tasks from videos. Activity Net captions is the dataset with temporal annotations [19] is used for object tracking. Video in this dataset consists of many segments with labels characterizing the actions. The Charades-STA dataset were often used for video captioning in the retrieval of video moments. The details of the dataset are mentioned in Table 4.
Dataset description.
Dataset description.
Our model is implemented in Ubuntu 16.04, YOLOv3, Tensorflow 1.8.0, with CUDA 11.0 and cudnn 8.0.5. All experiments are conducted on a workstation with dual NVIDIA GeForce GTX 3070 GPU. The network is trained with 50 epochs and with a minimum batch size of 30, the learning rate of 0.001. Moment and weighting factors are introduced, and the learning rate is reduced for every single epoch. For the inputted query image, 4096-d features of several regions are extracted with its position vector embedding using C3D CNN. 3D convolutional network is used to locate the boundary of the actions recognized within the video. Its global features dimension is reduced to 512 without any redundancy. Similarly, the visual features of 4096-d are extracted from video frames using C3D CNN for each unit consisting of 16 frames. Feature mapping is performed with a kernel of sizes 3 and 5 applied in convolution layers. On passing these features to the conventional stacked IMSA attention model, the feature of 256-d is projected. Layer normalization is added with the activation function. In contrastive learning, horizontal flipping and random cropping are done for data augmentation of the query image. The gaussian distribution data with a probability of 0.5 is used for training purposes. To reduce the combined loss, a Stochastic descent gradient optimizer is used. The proposed model is trained until the loss gets smoothened. Figure 3 shows the moments being localized in the videos for the given image query to the IMSA network. After empirically analyzing with various parameters, the efficient result is achieved for the above-mentioned metrics.

Moments localized in the video at right for the given image queries at left: (a) Pouring water (b) Girls playing tennis (c) Air force videos (d) Cooking pasta (e) Sea riding.
Video moment retrieval is measured by the Recall metric (R). The performance of the proposed model is evaluated based on two criteria such as mIoU (mean IoU) and IoU@R. The Intersection over Union (IoU) is estimated for all pairs of video segments and the ground truth values. The projected bounding box’s overlap with the actual bounding box is represented by an Intersection over Union. If the IoU is high, then the projected bounding box coordinates are very similar to the ground truth box coordinates. Then the mean of IoU for all made predictions is computed. Figure 4 represents the Recall and IoU curve for the proposed IMSA model compared to GTAN and SCDM approaches validated on ActivityNetCaptions dataset. To discriminate moments from distinct videos, the proposed learning target use global semantics. To provide it, the network is trained to align corresponding video-query pairings in the subspace, where video-query relevance R is estimated based on the moment similarity. Figure 5 illustrates the Recall vs IoU curve of the proposed IMSA model validated on Charades-STA dataset. As a result, the video-query relevance score is utilized to evaluate and report on the model’s effectiveness in identifying or retrieving the proper video given an image query.

Recall-IoU curve on ActivityNet_Caption.

Recall-IoU curve on Charades-STA.
The evaluation criteria used to measure the performance of the video retrieval task is R@K for three different datasets such as Activity Net captions, charades-STA, and sum me dataset. R@K value indicates the percentage of correctly predicted moments for the input query which is found in top K retrieved videos. IoU is calculated for the predicted video chunks and ground truth chunks of video for the query image. The percentage of predicted moments for which IoU is greater than R is computed and denoted as IoU@R. In previous methods, IoU scores are estimated for top k reimbursed video clips, and sliding windows are aligned based on the confidence score of those returned moments. Here IMSA model along with contrastive loss and GMM loss hyperparameters, the model hunts down the video until the best aligned moment is found in that clip and returns it. The IoU scores are calculated for R@1 and R@5 for two different datasets and are shown in Tables 5 and 6. The IOU is calculated between the clip of the state at time t, [
Performance comparison of ActivityNetCaptions.
Performance comparison of Charades-STA.
The variants of our system are tested for IoU
The Average Recall on different IoU of the proposed IMSA transformer model achieves 86.45% and 80.23% on ActivityNet and Charades-STA respectively. The mean average precision (mAP) is determined by first determining the Average Precision (AP) for each class, and then taking the mean of this value across all classes. The mAP takes into account the cost of false positives (FP) and false negatives (FN) and reflects the trade-off between precision and recall (FN). Evaluation of an object recognition and tracking model’s precision can be performed with the help of the mAP. On tracking the objects on processing videos at 30FPS has a mAP of 59.3% on ActivityNetCaptions dataset. Figure 6 illustrates the Average recall vs Average number of proposals curve obtained by the proposed IMSA model. The computational cost is reduced by using multiple queries ad the performance measure illustrating number of parameters, FLOPs and top-1 and top-5 results accuracy is shown in Table 7.
Performance measures.
Performance measures.

Average recall-average proposals curve.
The results reveals that the temporal moment localization by IMSA is efficient compared to other approaches. Processing the dark or light frames in the video is a challenging task for the proposed model as the misclassification rate is more. Thus, set of features may be able to retrieve frames having range of elimination variations may fail to retrieve noisy frames; while a set of features may effectively retrieve frames with various elimination variations, it might struggle with noisy or irrelevant information. This underscores the challenge of dealing with noisy data in video moment localization tasks. It extracts only video frames and fails to detect semantic from the noisy frames. To overcome such limitations, a fusion based feature extraction can be used in future.
Identifying moments in long, untrimmed videos using machine translation queries is a beneficial and demanding endeavour in fine-grained video retrieval. Locating the moments/instances in a video sequence using image queries is investigated and a model is proposed for temporal localization. Two novel algorithms, IMSA and contrastive learning with GMM were proposed to determine alignment locate the temporal bounds. ActivityNet and Charades-STA Benchmark datasets are given as input to the proposed model and the results are evaluated based on precision and recall under IoU curve. It introduces the IMSA transformer and contrastive learning with GMM algorithms, achieving an impressive 86% accuracy on ActivityNet and Charades-STA Benchmark datasets, surpassing convolutional networks in frame extraction for efficient video retrieval. Future research directions include exploring real-world applications like video surveillance, content recommendation, and summarization, alongside enhancing the model’s robustness in handling diverse frame conditions and improving semantic information detection through fusion-based feature extraction methods.
