Decoupling Hierarchical Errors: A Dual-Pronged Approach for In-Group and Out-of-Group Challenges in FGVC

Abstract

Fine-grained visual classification (FGVC) is challenged by subtle interclass differences, which lead to both out-of-group and in-group errors, particularly when label hierarchies are only partially available. To leverage hierarchical labels for enhanced feature discriminability, this paper introduces two novel loss functions. First, a hierarchically discriminative loss ( $L_{hd}$ ) linearly combines fine-grained subcategory features with their ancestral superclass features. This fusion, enhanced by cross-channel max pooling and channel-wise dropout, strengthens feature discriminability to suppress out-of-group errors. Second, an in-group regularization loss ( $L_{ig}$ ) addresses visually similar categories within the same superclass. It introduces controlled interference by mixing target class features with randomly selected features from other in-group classes. This process increases learning difficulty without altering labels, thereby mitigating sample-specific overfitting and reducing in-group errors. Our proposed loss functions are easy to integrate into existing frameworks, require no extra annotations or complex network modifications, and support end-to-end training. Extensive experiments on five benchmark datasets, including BreakHis, CUB-200-2011, Butterfly-200, FGVC-Aircraft, and Stanford Cars, demonstrate that our method consistently improves top-1 accuracy, as well as weighted average precision and mean average precision. Furthermore, we also observe consistent improvements in $F 1$ -score. The performance gains are particularly significant in scenarios with low fine-grained annotation ratios.

Keywords

fine-grained image classification hierarchical labels hierarchically discriminative loss in-group regularization deep learning

1. Introduction

Fine-grained visual classification (FGVC) aims to distinguish between subordinate categories within a domain, a task that is significantly more challenging than general object recognition due to subtle interclass variations and large intraclass variations. While a general object recognition system might be tasked with distinguishing a “car” from a “bicycle,” an FGVC system must differentiate between a “2012 Honda Civic” and a “2012 Toyota Camry,” which share a vast number of visual features. This demand for high precision has driven the adoption of FGVC in a wide range of applications where nuanced understanding is critical. These domains include automated biodiversity monitoring (Wah et al., 2011), where it aids in tracking and conserving species; fashion analysis (Chang et al., 2017), for powering recommendation engines and trend analysis; facial expression analysis (Lo et al., 2023), for more sophisticated human–computer interaction; intelligent transportation (Chen et al., 2021), for autonomous driving and traffic management; and medical analysis (Han et al., 2017), where it assists in the diagnosis and grading of diseases from medical imagery.

The precision of FGVC enables a deeper understanding and more detailed categorization of objects, which is crucial in fields where detailed information is vital for making important decisions. For example, in medical applications, the accurate and timely diagnosis of breast cancer is paramount. A fine-grained classification of breast cancer is more meaningful than a simple benign/malignant classification, as identifying subtypes of malignant tumors can help doctors select more appropriate treatment options and formulate targeted therapeutic plans. However, achieving this level of precision is a significant challenge, as it requires a model to learn highly discriminative features that can capture the subtle, localized differences that define subordinate classes. This difficulty has spurred the development of various advanced FGVC methods (Chang et al., 2020; Chen et al., 2022; Hu et al., 2025; Liu et al., 2022).

Many fine-grained classification datasets contain hierarchical labels from coarse to fine (Chen et al., 2018; Wah et al., 2011) (e.g., breast cancer-malignant-ductal carcinoma), which can be represented by a tree-like hierarchy. This structure is not merely an organizational convenience; it encodes crucial semantic relationships between categories. Based on this hierarchy, two primary semantic relationships exist between fine-grained categories: in-group and out-of-group. An “in-group” relationship exists between two fine-grained categories if they belong to the same coarse-grained category; otherwise, an “out-of-group” relationship exists. Figure 1 provides a schematic illustration of these relationships defined by the label hierarchy.

Figure 1.

Each object can be annotated at multiple levels in the label hierarchy. The coarse-to-fine labels can be used to define two semantic relationships between the same-level labels: out-of-group and in-group.

Figure 2(a) illustrates this relationship with image examples: the “Red-winged Blackbird” in the first column and the “Rusty Blackbird” in the second column have an in-group relationship, as both are types of blackbirds. In contrast, the “Blue-throated Goldentail” in the third column has an out-of-group relationship with the first two, as they belong to different coarse-grained superclasses (blackbirds vs. hummingbirds). In the BreakHis dataset shown in Figure 2(b), both adenosis and fibroadenoma are benign tumors, whereas ductal carcinoma is a malignant tumor. It is evident that subclasses under the same superclass are often visually very similar, making them difficult to distinguish. This high visual similarity among in-group categories is a primary source of classification errors, while distinguishing between out-of-group categories, though generally easier, still requires a model to learn robust parent-class features.

Figure 2.

Images selected from (a) the CUB-200-2011 dataset and (b) the BreakHis dataset, respectively.

This study aims to leverage hierarchical labels to learn more discriminative feature embeddings to reduce prediction errors. To this end, we propose two novel loss functions that target out-of-group and in-group prediction errors, respectively, by explicitly modeling the hierarchical relationships in the feature space.

First, to address out-of-group errors, we propose a hierarchically discriminative loss. This loss promotes the inheritance of parent superclass attributes in fine-grained categories by linearly combining their features with those of their coarse-grained ancestors. This feature fusion helps constrain fine-grained predictions within the correct group and effectively distinguishes them from categories with an out-of-group relationship, thereby reducing out-of-group errors.

Second, for in-group errors, we propose an in-group regularization loss. Since in-group categories have only subtle differences in appearance, this loss treats other in-group categories as distractors. By linearly combining the target category’s features with randomly selected in-group category features, it increases the difficulty of extracting discriminative features. This regularization strategy helps the network discover more unique features and reduces in-group misclassifications.

By effectively reducing misclassifications and enhancing the identification of subtle differences, our proposed loss functions are particularly valuable for improving the reliability of applications such as medical image analysis.

The main contributions of this paper can be summarized as follows:

We propose a hierarchically discriminative loss, $L_{hd}$ , which utilizes the coarse-to-fine label hierarchy to enhance the discriminability between the target category and out-of-group categories through feature fusion.

We propose an in-group regularization loss, $L_{ig}$ , which enhances the discriminative ability for in-group categories by regularizing the target category with noisy features.

Extensive experiments on a medical dataset and several FGVC benchmark datasets demonstrate that the proposed loss functions can effectively improve the performance of existing multigranularity methods, especially in scenarios with limited fine-grained annotations.

The remainder of this paper is organized as follows. Section 2 reviews related work in FGVC and hierarchical classification. Section 3 details our proposed methodology, including the architecture and the formulation of the two novel loss functions. Section 4 presents the experimental setup, datasets, evaluation metrics, and a comprehensive analysis of the results, including ablation studies and comparisons with state-of-the-art (SOTA) methods. Finally, Section 5 discusses the limitations and future directions, and Section 6 concludes the paper.

2. Related Work

2.1. Fine-Grained Visual Classification

FGVC is a specialized form of image classification that distinguishes between closely related categories by identifying subtle interclass differences. Traditional fine-grained problems, such as distinguishing between the species of birds or flowers, have attracted considerable research attention. Early works (Wei et al., 2018; Zhang et al., 2014) often relied on a localization-classification pipeline, attempting to address these problems using extra annotations, such as object bounding boxes or part-level labels. For example, Part Region-based Convolutional Neural Network (Zhang et al., 2014) learned detectors and part models, imposing geometric constraints between parts and between parts and the object. Subsequently, a fine-grained category was predicted from a pose-normalized representation.

However, acquiring supervisory information for parts or bounding boxes is time-consuming and expensive. To overcome this issue, subsequent methods have focused on weakly supervised approaches that can localize discriminative parts using only image-level labels (Chang et al., 2020; Yang et al., 2018; Zheng et al., 2017). These approaches often leverage attention mechanisms to guide feature extraction, focusing on subtle yet informative parts of the image. For instance, Zheng et al. (2017) proposed the Multiattention Convolutional Neural Network, which can simultaneously learn discriminative parts and fine-grained feature representations across all feature channels. Another popular line of work involves orderless feature encoding, such as Bilinear Convolutional Neural Networks (CNNs) (Liu et al., 2020), which capture second-order statistics of feature interactions to create highly discriminative representations.

Despite their success in localizing key regions, a common limitation of these traditional FGVC methods is that they treat all categories as a flat list. They typically do not explicitly model the semantic relationships that exist between categories, such as the hierarchical structure. Chang et al. (2020) introduced Mutual-Channel Loss (MC-Loss), which consists of a discriminative loss and a diversity loss. The discriminative loss ensures that each feature channel is discriminative for a specific category, while the diversity loss encourages feature channels to be mutually exclusive in the spatial dimension. However, by ignoring the label hierarchy, MC-Loss and similar methods are limited in their ability to transfer knowledge from coarse to fine levels, a gap that our proposed method aims to fill.

2.2. Hierarchical FGVC

Recent research has focused on applying multigranularity labels to FGVC (Chang et al., 2021; Chen et al., 2018, 2022). Chen et al. (2018) proposed the hierarchical semantic embedding framework, which predicts category score vectors in a coarse-to-fine hierarchical order. The predicted score vectors at each coarse-grained level are used as prior knowledge to learn more fine-grained feature representations. Chang et al. (2021) divided the classification head into hierarchy-specific heads and fused fine-grained features into coarse-grained features for coarse-level classification.

To further explore the semantic relationships within the label hierarchy, Chen et al. (2022) proposed a combined loss function that integrates a tree-like hierarchical probability classification loss, encoding parent–child relationships, with a multiclass entropy loss for the leaf categories. Liu et al. (2022) designed a cross-hierarchy orthogonal fusion module to simulate the human attention process, which shifts from coarse to fine details.

2.3. Transformer-Based FGVC

In recent years, the Transformer architecture, which has achieved tremendous success in natural language processing, has been adapted for computer vision tasks. The introduction of the Vision Transformer (ViT) (Han et al., 2022) marked a significant shift, demonstrating that a pure Transformer-based model can achieve SOTA performance on image classification benchmarks. Unlike traditional CNNs that process images through convolutional filters, ViT treats an image as a sequence of flattened patches, using self-attention mechanisms to capture global dependencies between them.

This new paradigm has been quickly adopted and extended for FGVC, leading to a new class of powerful models. For instance, TransFG (He et al., 2022) progressively selects discriminative image patches at different layers using an attention-guided mechanism, effectively filtering out background noise. RAMS-Trans (Hu et al., 2021) introduces a recurrent attention multiscale Transformer to iteratively locate and magnify subtle regions. Feature Fusion Vision Transformer (FFVT) (Wang et al., 2021) focuses on fusing features from different Transformer blocks to obtain a more comprehensive representation. Dual Cross-Attention Learning (DCAL) (Zhu et al., 2022) proposes a DCAL mechanism to enhance feature discrimination by modeling interpart and interimage relationships. These Transformer-based approaches have shown great promise, often outperforming their CNN-based counterparts by leveraging the ability of self-attention to model long-range dependencies and focus on subtle, discriminative regions without explicit part annotations.

2.4. Pathological Image Classification

Early methods for pathological image classification (Doyle et al., 2008) primarily relied on hand-crafted features, such as morphological, texture, and nuclear features. With the excellent performance of deep neural networks in computer vision tasks, recent studies (Han et al., 2017; Hou, 2020; Li et al., 2023) have increasingly adopted deep learning for pathological image classification, achieving remarkable results. Furthermore, some models (Tian et al., 2023) have borrowed ideas from fine-grained classification on natural image datasets to tackle pathological image classification tasks. Beyond pathology, deep learning architectures like ResNet have also demonstrated robust performance in broader health classification tasks, such as fetal health monitoring via sensor fusion (Selvan et al., 2025), highlighting the versatility of these backbones as benchmarks for medical applications.

3. Methods

3.1. Overview

Our network architecture, illustrated in Figure 3, is designed to explicitly leverage label hierarchies for improved fine-grained classification. It integrates our two proposed loss functions, the hierarchically discriminative loss ( $L_{hd}$ ) and the in-group regularization loss ( $L_{ig}$ ), into a multilevel feature extraction framework. The architecture consists of four main components: a shared backbone network, a set of hierarchy-specific convolutional blocks, corresponding fully connected (FC) layers, and classification heads.

Given an input image, a foundational feature map $F \in R^{C \times H \times W}$ is first extracted using a standard backbone network (e.g., ResNet-50; He et al., 2016), where $H$ , $W$ , and $C$ denote the height, width, and number of channels, respectively. This backbone serves as a powerful general-purpose feature extractor. The feature map $F$ is then passed to three parallel convolutional blocks. Each block is structurally identical but is trained to specialize in extracting features for a specific level of the label hierarchy. These multilevel features, denoted from coarse to fine as $F^{1}$ , $F^{2}$ , and $F^{3}$ , maintain the same dimensions as $F$ . This multibranch design allows the network to learn representations at different semantic granularities simultaneously. The specialized features are then used for two purposes: they are passed through level-specific FC layers and classification heads to calculate a standard cross-entropy loss ( $L_{ce}$ ) for each level, and they serve as the inputs for our proposed hierarchical loss functions, $L_{hd}$ and $L_{ig}$ , which enforce hierarchical consistency and robustness.

3.2. Hierarchically Discriminative Loss

As shown in Figure 3, each image is annotated with a three-level label chain, $y_{1}, y_{2}, y_{3}$ , representing coarse-to-fine granularities. Let $c_{k}$ be the number of categories at level $k$ , with $y_{3}$ and $c_{3}$ denoting the ground-truth label and category count at the finest level.

The backbone feature map $F$ is processed by three convolutional blocks to extract multigranularity feature maps $F^{k} \in R^{C \times H \times W}$ for $k = 1, 2, 3$ . For each level $k$ , the feature map $F^{k}$ is composed of class-aligned feature groups $F_{i}^{k} \in R^{n_{k} \times H \times W}$ for each of the $c_{k}$ classes, where $n_{k}$ is the number of channels dedicated to representing each class.

Figure 3.

The framework of the multigranularity fine-grained classification network using $L_{hd}$ and $L_{ig}$ . We illustrate the network architecture, which includes three hierarchical levels. The convolutional block at each level generates features of a specific granularity, which are then used as input for $L_{hd}$ and $L_{ig}$ .

To leverage the semantic hierarchy, we propose a hierarchically discriminative loss that encourages subclass features to inherit the properties of their parent classes. This helps to distinguish them from out-of-group categories. We first compute a hierarchically fused feature, denoted as $S_{i}^{k}$ for class $i$ at level $k$ , as follows:

S_{i}^{k} = \frac{1}{k} \sum_{r = 1}^{k} \underset{CCMP}{\underset{⏟}{max_{j = 1, 2, \dots, n_{k}}}} \underset{CWD}{\underset{⏟}{[M_{i}^{r} \cdot F_{i, j}^{r}]}}

(1)

where CCMP and CWD represent cross-channel max pooling and channel-wise dropout, respectively.

CWD is designed to ensure that the network captures discriminative information from all $n_{k}$ channels associated with a specific category. To prevent the model from relying on a small subset of highly correlated channels, CWD adaptively masks the most correlated ones.

Specifically, to compute the correlation matrix $M_{c} \in R^{n_{k} \times n_{k}}$ for the feature map $F_{i}^{r}$ , we first apply $3 \times 3$ average pooling to smooth the features. Then, we use the $argmax (\cdot)$ operation to find the coordinates of the peak response for each channel, generating a position matrix $P \in R^{n_{k} \times 2}$ :

P = [(t_{x}^{1}, t_{y}^{1}), (t_{x}^{2}, t_{y}^{2}), \dots, (t_{x}^{n_{k}}, t_{y}^{n_{k}})]

(2)

where

(t_{x}^{j}, t_{y}^{j})

are the peak response coordinates of the

j

-th channel. Next, we compute the Euclidean distance between peak positions to generate the correlation matrix

M_{c} (i, j)

M_{c} (i, j) = | | (t_{x}^{i}, t_{y}^{i}) - (t_{x}^{j}, t_{y}^{j}) | |

(3)

A dropout mask

M_{i}^{r}

is then generated by setting the top

γ

fraction of elements with the highest correlation to 0 and the rest to 1. This mask is applied to the feature map

F_{i, j}^{r}

via broadcast multiplication. Note that this masking is active only during training and is normalized:

F_{i, j}^{' r} = M_{i}^{r} \times F_{i, j}^{r}

(4)

By masking spatially correlated feature channels, CWD forces the model to discover and learn other, more diverse discriminative patterns. While the concept of channel-wise dropout has been explored in various forms (Hou et al., 2019; Tompson et al., 2015), our specific implementation of CWD is tailored for hierarchical FGVC. Unlike standard channel dropout which randomly discards channels, or spatial dropout which drops entire feature maps independently, our CWD uses a correlation-based criterion. It specifically targets and suppresses channels that focus on redundant spatial locations (based on peak response coordinates), thereby explicitly encouraging the network to diversify its attention to cover complementary object parts rather than just the most salient ones. This is particularly crucial in FGVC, where subtle differences may be captured by less dominant features. By preventing the model from focusing only on the most salient parts, CWD encourages a more holistic understanding of the object, which can lead to better generalization and robustness against minor variations or occlusions.

CCMP computes the maximum response across the $n_{k}$ feature channels for a given class. This operation collapses the class-aligned features into a single-channel feature map, effectively retaining the most salient local information, which is critical for fine-grained classification. By selecting the maximal activation at each spatial location, CCMP acts as a feature selection mechanism, highlighting the most responsive regions for a particular category while suppressing weaker, potentially noisy activations. This process not only distills the most discriminative information but also reduces computational complexity for subsequent layers.

The summation in equation (1) linearly combines the features from the current fine-grained level with those from all its ancestral coarse-grained levels. This ensures that fine-grained categories not only learn their unique attributes but also inherit the defining properties of their superclasses. This hierarchical fusion strengthens the model’s ability to learn discriminative features that respect the label hierarchy.

The proposed hierarchically discriminative loss $L_{hd}$ is then defined as the sum of cross-entropy losses over all levels:

\begin{aligned} L_{hd} & = \sum_{k = 1}^{3} L_{hd}^{k} \\ L_{hd}^{k} & = L_{CE} (y^{k}, σ ([g (S_{1}^{k}), g (S_{2}^{k}), \dots, g (S_{c_{k}}^{k})])) \\ g (S_{i}^{k}) & = \frac{1}{H W} \sum_{n = 1}^{H W} S_{i, n}^{k}, i = 1, 2, \dots, c_{k} \end{aligned}

(5)

where

g (\cdot)

denotes the global average pooling operation and

σ

is the softmax function. It is worth noting that the discriminative component from MC-Loss (Chang et al., 2020) is a special case of our proposed

L_{hd}

; our loss degenerates to the MC-Loss discriminative component when the label hierarchy has only a single level.

Figure 4 illustrates this process for a two-level hierarchy. The fine-grained features $F^{2}$ are first processed by CWD, which applies a mask $M_{i}^{2}$ to drop correlated feature channels. CCMP then computes the maximum response across the channels to produce a single-channel feature map $S_{i}^{2}$ . These maps are linearly combined with coarse-grained features and then passed through a global average pooling layer to compute class scores. Note that the hierarchically discriminative loss is applied only during the training phase.

Figure 4.

Overview of the hierarchically discriminative loss.

3.3. In-Group Regularization Loss

In FGVC, visual differences between categories are often subtle, particularly for categories sharing an in-group relationship. This high intragroup similarity, combined with the often limited size of fine-grained datasets (e.g., CUB-200-2011; Wah et al., 2011), has only 5,994 training images for 200 classes, whereas ImageNet (Deng et al., 2009) has over a million images for 1,000 classes), increases the risk of overfitting. The network may learn to rely on sample-specific artifacts rather than generalizable, discriminative features to distinguish between visually similar categories, leading to in-group errors. Traditional regularization techniques like weight decay or standard dropout may not be sufficient to address this specific challenge, as they do not explicitly account for the semantic similarity between in-group classes.

To address this, we propose an in-group regularization loss, $L_{ig}$ , that regularizes the learning of class-aligned features by introducing controlled noise. Specifically, at the finest label granularity, for a given ground-truth class $y_{t}^{3}$ , we first identify the set of all $p$ in-group classes $C_{t} = {y_{i}^{3} | y_{i}^{3} and y_{t}^{3} have an in-group relationship}$ , which includes $y_{t}^{3}$ itself. For each class $y_{i}^{3} \in C_{t}$ , we retrieve its corresponding class-specific feature $S_{i}^{3}$ . We then create a noisy version of the target feature, $S_{t}^{' 3}$ , by linearly combining it with a distractor feature $S_{d}^{3}$ , which is randomly sampled from the set of other in-group class features:

\begin{aligned} S_{t}^{' 3} & = (1 - ω) S_{t}^{3} + ω S_{d}^{3} \\ S_{d}^{3} & = R ({S_{j}^{3} | y_{j}^{3} \in C_{t}, j \neq t}) \end{aligned}

(6)

where the hyperparameter

ω

controls the weight of the noise feature, and

R (\cdot)

is a function that randomly selects one feature from the input set. A study of different choices for

R (\cdot)

is presented in Section 4.6.

It is important to note that unlike standard data augmentation techniques such as MixUp (Zhang et al., 2018), which interpolate both input images and their corresponding labels (using soft labels), our approach mixes features while retaining the original hard label of the target class. We deliberately chose this design over label smoothing or label mixing for two key reasons. First, our goal with $L_{ig}$ is to regularize the feature space by introducing “distractors” rather than to create a new, intermediate concept. By forcing the model to classify a feature vector contaminated with in-group noise as the original target class, we compel the network to identify and focus on the unique, robust traits of the target class that persist despite the interference. Second, in fine-grained tasks, the semantic margin between in-group classes is already extremely narrow. Using soft labels (e.g., assigning a partial probability to the distractor class) might inadvertently encourage the model to learn shared, ambiguous features, which is counterproductive when the objective is to maximize interclass discriminability within a group. By maintaining the hard label, we impose a stricter constraint that sharpens the decision boundary around the target class.

The in-group regularization loss $L_{ig}$ is then calculated as a cross-entropy loss over only the $p$ in-group categories. A modified feature set ${{\bar{S}}_{i}^{3}}_{i = 1}^{p}$ is constructed, where the feature for the target class is replaced by its noisy version $S_{t}^{' 3}$ , while all other in-group features remain unchanged. The model’s predictions are computed from this modified feature set and normalized via a softmax function restricted to this set:

\begin{aligned} L_{ig} & = L_{CE} (y_{t}^{3}, softmax ([g ({\bar{S}}_{1}^{3}), g ({\bar{S}}_{2}^{3}), \dots, g ({\bar{S}}_{p}^{3})])) \\ g ({\bar{S}}_{i}^{3}) & = \frac{1}{H W} \sum_{n = 1}^{H W} {\bar{S}}_{i, n}^{3}, i = 1, 2, \dots, p \end{aligned}

(7)

where

g (\cdot)

denotes the global average pooling operation and

y_{t}^{3}

is the ground-truth label for the target class.

By introducing controlled interference from confusing classes, $L_{ig}$ increases the learning difficulty. To correctly classify the noisy target feature, the network is forced to learn more robust and discriminative representations, rather than overfitting to sample-specific details. This regularization from a complementary perspective helps the model discover more subtle and unique features. As shown in the visualizations in Figure 5, the introduction of $L_{ig}$ enables the model to diffuse its attention and explore additional complementary object parts for recognition.

Figure 5.

Attention map visualizations for HRN, HRN + $L_{hd}$ , and HRN + $L_{hd}$ + $L_{ig}$ on the CUB dataset.

3.4. Overall Loss

For multigranularity classification, we use a standard cross-entropy loss at each of the three hierarchical levels, summed as $L_{ce} = \sum_{k = 1}^{3} L_{CE} (y_{k}, {\hat{y}}_{k})$ . The total loss for our model is a weighted combination of this standard loss and our two proposed losses:

\begin{aligned} L_{total} = λ_{ce} L_{ce} + λ_{hd} L_{hd} + λ_{ig} L_{ig} \end{aligned}

(8)

where

λ_{ce}

λ_{hd}

, and

λ_{ig}

are hyperparameters weighting the three components. In our experiments, we set

λ_{ce} = λ_{hd} = λ_{ig} = 0.5

4. Experiments

4.1. Datasets

We evaluated our method on five challenging benchmark datasets, including four standard FGVC datasets and one medical imaging dataset. For datasets without a predefined label hierarchy, we followed the procedure from Chang et al. (2021) and constructed hierarchies by tracing parent nodes on Wikipedia. For all datasets, we used the official training and testing splits.

CUB-200-2011 (CUB) (Wah et al., 2011) is one of the most widely used benchmarks for FGVC. It contains 11,788 images of 200 bird species, with 5,994 images for training and 5,794 for testing. This dataset is particularly challenging due to high intraclass variance caused by differences in bird pose, scale, lighting, and background clutter. Furthermore, it features very low interclass variance, as many species (e.g., different types of sparrows) are visually almost identical to the untrained eye. For our experiments, we use a three-level label hierarchy based on avian taxonomy, which includes 13 orders, 38 families, and 200 species at the finest level.

Butterfly-200 (Chen et al., 2018) is a large-scale butterfly dataset with 25,279 images across 200 species (5,135 for training, 15,009 for testing). A key feature of this dataset is its deep, four-level label hierarchy, consisting of 5 families, 23 subfamilies, 116 genera, and 200 species. This deep structure makes it an excellent benchmark for evaluating the ability of a model to leverage complex, multilevel semantic relationships. The visual similarity between different genera and species of butterflies presents a significant fine-grained recognition challenge.

FGVC-Aircraft (AIR) (Maji et al., 2013) consists of 10,000 images of 100 different aircraft models, with 6,667 training and 3,333 test images. The fine-grained challenge in this dataset arises from the need to distinguish between aircraft that are often variants of a single model series. For example, a Boeing 737-700 and a Boeing 737-800 may differ only subtly in fuselage length or winglet design. We use a three-level hierarchy that reflects this structure, comprising 30 manufacturers, 70 series, and 100 models.

Stanford Cars (CAR) (Krause et al., 2013) includes 16,185 images of 196 car models, with 8,144 for training and 8,041 for testing. The primary challenge is to classify cars based on their make, model, and year. Distinctions often depend on subtle details like the shape of the headlights or grille, which can vary slightly between model years of the same car. The images also feature significant variation in viewpoint and background clutter. For this dataset, we use a two-level hierarchy consisting of 9 coarse-grained car types (e.g., SUV, Sedan) and 196 fine-grained models.

BreakHis (Spanhol et al., 2015) is a breast cancer histopathology dataset containing 7,909 microscopic images from 82 patients. A unique aspect of this dataset is that images are provided at four different magnification factors ( $40 \times$ , $100 \times$ , $200 \times$ , $400 \times$ ), which adds a layer of complexity as the visual features change with scale. The classification task is inherently hierarchical, with a two-level structure. The top level distinguishes between benign and malignant tumors, a critical clinical determination. The second level further classifies tumors into eight specific subtypes (four benign: adenosis [A], fibroadenoma [F], phyllodes tumor [PT], and tubular adenoma [TA]; and four malignant: ductal carcinoma [DC], lobular carcinoma [LC], mucinous carcinoma [MC], and papillary carcinoma [PC]). Distinguishing between these subtypes is a classic FGVC problem, as they often share morphological features, making accurate classification vital for treatment planning.

4.2. Implementation Details

Image Preprocessing and Augmentation. For a fair comparison, we adopted a standard preprocessing pipeline. All images were first resized to $550 \times 550$ pixels. During the training phase, we performed random cropping to extract $448 \times 448$ patches. This serves as a simple yet effective data augmentation technique, encouraging the model to learn features that are robust to variations in object scale and position. In addition to random cropping, we applied random horizontal flipping with a probability of 0.5. For the testing phase, we used a single center crop of size $448 \times 448$ from the resized $550 \times 550$ image. No other complex augmentation methods were used.

Network Architecture and Training. We used a ResNet-50 model (He et al., 2016), pretrained on the ImageNet dataset, as our feature extraction backbone. We utilized the backbone without any structural modifications, keeping the original layer configurations intact. The weights of the backbone were fine-tuned during training. The three hierarchy-specific convolutional blocks following the backbone each consist of a $3 \times 3$ convolutional layer, a batch normalization layer, and a ReLU activation function. The number of output channels for these blocks was set to $n_{k} \times c_{k}$ , where $c_{k}$ is the number of classes at level $k$ and $n_{k}$ (the number of channels per class) was set to 16.

Optimization and Hyperparameters. For optimization, we used Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005. The use of momentum helps accelerate SGD in the relevant direction and dampens oscillations. A differential learning rate strategy was employed: the learning rate for the pretrained backbone network was set to a smaller value of 0.0002, while the learning rate for all other newly added components (convolutional blocks, FC layers) was set to 0.002. This strategy is common in transfer learning to ensure that the pretrained weights are not drastically altered early in the training process. We employed a cosine annealing strategy to schedule the learning rate, which starts with a relatively high rate and gradually decreases it, allowing for better convergence. The model was trained for a total of 200 epochs with a batch size of 8. The loss hyperparameters were set to $λ_{ce} = λ_{hd} = λ_{ig} = 0.5$ based on the hyperparameter analysis presented in Section 4.6. All experiments were conducted using the PyTorch deep learning framework (version 1.10) on a single Nvidia GeForce RTX 3090 GPU with CUDA 11.3.

Inference Phase. During the inference phase, the auxiliary branches and the proposed loss calculations ( $L_{hd}$ and $L_{ig}$ ) are removed. Only the main classification head at the finest granularity is retained for prediction. Consequently, the computational cost (floating-point operation [FLOPs]) and latency during inference are identical to those of the standard backbone network, ensuring that our method improves accuracy without introducing any additional overhead at test time.

4.3. Experimental Design and Evaluation Metrics

Following Chen et al. (2022), our experimental setup simulates scenarios with partially available fine-grained labels. We created training sets where 10%, 30%, 50%, 70%, and 100% of the samples have complete, fine-grained labels, while the remaining samples are annotated with only coarse-grained labels. During inference, all models were evaluated on the test set using the complete label hierarchy.

We use four metrics for performance evaluation, among which the first metric consists of top-1 accuracy evaluated at three different semantic granularity levels, namely the finest-grained, intermediate-grained, and coarse-grained levels. To assess performance across the entire label hierarchy, we use the mean Average Precision ( $mAP$ ) and the weighted average precision ( $wAP$ ). The $mAP$ is the unweighted average of the top-1 accuracies across all levels, defined as:

mAP = \frac{1}{L} \sum_{l = 1}^{L} P_{l}

(9)

where

L

is the number of levels in the hierarchy and

P_{l}

is the top-1 accuracy at level

l

(e.g.,

P_{1}

is the accuracy at the coarsest level, and

P_{3}

is the accuracy at the finest level). This metric treats each hierarchical level with equal importance. In contrast, the

wAP

is defined as:

wAP = \sum_{l = 1}^{L} \frac{N_{l}}{\sum_{k = 1}^{L} N_{k}} P_{l}

(10)

where

N_{l}

is the number of categories at level

l

. This metric gives greater weight to performance on more granular levels, which have more categories. Finally, to provide a more comprehensive evaluation, particularly for imbalanced classes, we also report the Macro-

F 1

score at the finest-grained level, which is the harmonic mean of precision and recall calculated for each class and then averaged. Using a combination of these metrics is crucial. While top-1 accuracy at the leaf level is the ultimate goal, it does not penalize a model that correctly identifies a bird as a “Warbler” (family) but fails to identify it as a “Cape May Warbler” (species) any differently than a model that misclassifies it as an “Airplane.” The hierarchical metrics,

mAP

and

wAP

, capture this nuance. A high score in these metrics indicates that the model is not only accurate at the fine-grained level but also makes semantically reasonable predictions at coarser levels, reflecting a deeper and more robust understanding of the class relationships.

4.4. Results on the Pathology Dataset

For a representative medical application, we evaluated our method on the BreakHis dataset (Spanhol et al., 2015) for breast cancer classification, using a 100% complete label hierarchy.

The baseline methods were divided into three groups: (1) traditional CNNs, such as ResNet-50 (He et al., 2016) and Inception V3 (Szegedy et al., 2016); (2) fine-grained methods designed for natural images, such as NTS-Net (Yang et al., 2018) and WS-DAN; and (3) fine-grained methods specifically developed for medical images, such as DLD-Net (Tian et al., 2023), EFML (Li et al., 2023), and MagNet (Iqbal et al., 2024). We also include comparisons with the recent SOTA methods SFME (Yu et al., 2025) and MCG (Yang et al., 2024).

As shown in Table 1, our method achieves the highest accuracy across all image magnifications. We attribute this consistent improvement to our two proposed loss functions, which leverage the label hierarchy to guide the model toward learning more discriminative features while mitigating overfitting to sample-specific artifacts. The task of classifying breast cancer subtypes is a natural fit for our approach. For instance, distinguishing between benign subtypes like adenosis and fibroadenoma represents a classic in-group challenge due to their high visual similarity. Our in-group regularization loss ( $L_{ig}$ ) is specifically designed to handle such cases by forcing the model to discover subtle, yet consistent, differentiating features. Similarly, the clear hierarchical distinction between benign and malignant tumors is effectively exploited by our hierarchically discriminative loss ( $L_{hd}$ ), which reduces the risk of severe out-of-group errors. It is noteworthy that recent methods also achieve strong results; however, our targeted hierarchical loss functions provide a clear performance advantage by explicitly modeling the semantic relationships between the class labels.

Table 1.
Results on the BreakHis Dataset.

Accuracy/ $F 1$ -Score

Method 40 $\times$ 100 $\times$ 200 $\times$ 400 $\times$

ResNet-50 (He et al., 2016) 0.940/0.934 0.958/0.951 0.927/0.920 0.938/0.932

Inception V3 (Szegedy et al., 2016) 0.902/0.896 0.856/0.848 0.861/0.854 0.825/0.819

22-Layer CNN (Hou, 2020) 0.909/0.902 0.900/0.893 0.910/0.905 0.910/0.904

CSD-CNN (Han et al., 2017) 0.928/0.921 0.939/0.934 0.937/0.930 0.929/0.924

MT-FT-Xception (Li et al., 2020) 0.948/0.943 0.940/0.933 0.939/0.931 0.907/0.900

MuDeRN (Gandomkar et al., 2018) 0.956/0.949 0.949/0.944 0.957/0.950 0.946/0.941

NTS-Net (Yang et al., 2018) 0.967/0.960 0.952/0.947 0.934/0.929 0.912/0.905

WS-DAN (Hu et al., 2019) 0.950/0.945 0.962/0.955 0.920/0.913 0.901/0.896

BCNNs (Liu et al., 2020) 0.960/0.953 0.958/0.951 0.947/0.942 0.945/0.940

DLD-Net (Tian et al., 2023) 0.973/0.966 0.974/0.969 0.940/0.933 0.963/0.958

SVM(RBF)-ASO (Atban et al., 2023) 0.961/0.954 0.943/0.936 0.973/0.966 0.963/0.956

EFML (Li et al., 2023) 0.968/0.963 0.974/0.969 0.967/0.960 0.964/0.959

MagNet (Iqbal et al., 2024) 0.974/0.969 0.941/0.934 0.975/0.968 0.981/0.976

MCG (Yang et al., 2024) 0.975/0.970 0.978/0.973 0.976/0.971 0.980/0.975

SFME (Yu et al., 2025) 0.978/0.973 0.975/0.970 0.979/0.974 0.982/0.977

Ours 0.984/0.980 0.984/0.979 0.983/0.978 0.984/0.981

	Accuracy/ $F 1$ -Score
ResNet-50 (He et al., 2016)	0.940/0.934	0.958/0.951	0.927/0.920	0.938/0.932
Inception V3 (Szegedy et al., 2016)	0.902/0.896	0.856/0.848	0.861/0.854	0.825/0.819
22-Layer CNN (Hou, 2020)	0.909/0.902	0.900/0.893	0.910/0.905	0.910/0.904
CSD-CNN (Han et al., 2017)	0.928/0.921	0.939/0.934	0.937/0.930	0.929/0.924
MT-FT-Xception (Li et al., 2020)	0.948/0.943	0.940/0.933	0.939/0.931	0.907/0.900
MuDeRN (Gandomkar et al., 2018)	0.956/0.949	0.949/0.944	0.957/0.950	0.946/0.941
NTS-Net (Yang et al., 2018)	0.967/0.960	0.952/0.947	0.934/0.929	0.912/0.905
WS-DAN (Hu et al., 2019)	0.950/0.945	0.962/0.955	0.920/0.913	0.901/0.896
BCNNs (Liu et al., 2020)	0.960/0.953	0.958/0.951	0.947/0.942	0.945/0.940
DLD-Net (Tian et al., 2023)	0.973/0.966	0.974/0.969	0.940/0.933	0.963/0.958
SVM(RBF)-ASO (Atban et al., 2023)	0.961/0.954	0.943/0.936	0.973/0.966	0.963/0.956
EFML (Li et al., 2023)	0.968/0.963	0.974/0.969	0.967/0.960	0.964/0.959
MagNet (Iqbal et al., 2024)	0.974/0.969	0.941/0.934	0.975/0.968	0.981/0.976
MCG (Yang et al., 2024)	0.975/0.970	0.978/0.973	0.976/0.971	0.980/0.975
SFME (Yu et al., 2025)	0.978/0.973	0.975/0.970	0.979/0.974	0.982/0.977
Ours	0.984/0.980	0.984/0.979	0.983/0.978	0.984/0.981

The best results are shown in bold.

4.5. Results on Fine-Grained Datasets

We conducted a comprehensive evaluation on four standard FGVC benchmarks—CUB-200-2011, Butterfly-200, FGVC-Aircraft, and Stanford Cars—following the hierarchical setup from Chen et al. (2022). This involved training all methods with varying ratios of fine-grained labels.

For comparison, we selected SOTA techniques in both hierarchical multilabel classification (HMC-LMLP, Cerri et al., 2016; HMCN, Wehrmann et al., 2018; C-HMCNN, Giunchiglia & Lukasiewicz, 2020), hierarchical fine-grained classification (Chang et al., 2021; HRN, Chen et al., 2022), and recent general FGVC methods (MCG, Yang et al., 2024; SFME, Yu et al., 2025).

As shown in Tables 2 through 6, our method consistently outperforms all baselines in both top-1 accuracy at each hierarchical level and overall performance as measured by both $wAP$ and $mAP$ across all annotation ratios. Additionally, our method achieves the highest $F 1$ -scores, indicating a better balance between precision and recall, especially for harder-to-classify fine-grained categories. The performance gains are particularly pronounced in low-annotation scenarios, where the model must generalize from limited data. For instance, with only 10% of training samples having fine-grained labels (Table 2), our method improves the $wAP$ over the strongest new baseline by 2.7% on CUB, 4.5% on AIR, and 4.9% on CAR. This demonstrates the strong regularization effect of our proposed losses, which guides the feature learning process even when detailed supervision is scarce. With 30% fine-grained labels (Table 3), the improvements remain substantial at 2.5%, 1.9%, and 0.8% for the $wAP$ on CUB, AIR, and CAR, respectively.

Table 2.
Comparison of Accuracy Results with SOTA Methods on Three Datasets.

CUB-200-2011 (CUB) FGVC-Aircraft (AIR) Stanford Cars (CAR)

Method $P_{1}$ $P_{2}$ $P_{3}$ $F 1$ $wAP$ $mAP$ $P_{1}$ $P_{2}$ $P_{3}$ $F 1$ $wAP$ $mAP$ $P_{1}$ $P_{2}$ $F 1$ $wAP$ $mAP$

HMC-LMLP (Cerri et al., 2016) 0.984 0.944 0.229 0.212 0.376 0.719 0.970 0.934 0.744 0.722 0.844 0.883 0.964 0.135 0.118 0.171 0.550

HMCN (Wehrmann et al., 2018) 0.973 0.869 0.307 0.289 0.427 0.716 0.934 0.895 0.701 0.681 0.804 0.843 0.934 0.199 0.172 0.231 0.567

C-HMCNN (Giunchiglia & Lukasiewicz, 2020) 0.983 0.944 0.262 0.246 0.403 0.730 0.968 0.944 0.710 0.692 0.831 0.874 0.965 0.135 0.120 0.171 0.550

Chang et al. (Chang et al., 2021) 0.971 0.919 0.494 0.475 0.583 0.795 0.499 0.937 0.650 0.628 0.728 0.695 0.923 0.458 0.440 0.478 0.691

HRN (Chen et al., 2022) 0.980 0.933 0.530 0.512 0.614 0.814 0.954 0.917 0.711 0.689 0.820 0.861 0.943 0.493 0.474 0.513 0.718

MCG (Yang et al., 2024) 0.981 0.935 0.535 0.518 0.618 0.817 0.958 0.921 0.716 0.697 0.825 0.865 0.948 0.501 0.482 0.520 0.724

SFME (Yu et al., 2025) 0.982 0.938 0.541 0.523 0.625 0.820 0.960 0.925 0.723 0.705 0.830 0.869 0.951 0.512 0.494 0.530 0.731

Ours 0.989 0.953 0.573 0.556 0.652 0.838 0.973 0.952 0.792 0.771 0.875 0.906 0.966 0.561 0.541 0.579 0.764

	CUB-200-2011 (CUB)	FGVC-Aircraft (AIR)	Stanford Cars (CAR)
HMC-LMLP (Cerri et al., 2016)	0.984	0.944	0.229	0.212	0.376	0.719	0.970	0.934	0.744	0.722	0.844	0.883	0.964	0.135	0.118	0.171	0.550
HMCN (Wehrmann et al., 2018)	0.973	0.869	0.307	0.289	0.427	0.716	0.934	0.895	0.701	0.681	0.804	0.843	0.934	0.199	0.172	0.231	0.567
C-HMCNN (Giunchiglia & Lukasiewicz, 2020)	0.983	0.944	0.262	0.246	0.403	0.730	0.968	0.944	0.710	0.692	0.831	0.874	0.965	0.135	0.120	0.171	0.550
Chang et al. (Chang et al., 2021)	0.971	0.919	0.494	0.475	0.583	0.795	0.499	0.937	0.650	0.628	0.728	0.695	0.923	0.458	0.440	0.478	0.691
HRN (Chen et al., 2022)	0.980	0.933	0.530	0.512	0.614	0.814	0.954	0.917	0.711	0.689	0.820	0.861	0.943	0.493	0.474	0.513	0.718
MCG (Yang et al., 2024)	0.981	0.935	0.535	0.518	0.618	0.817	0.958	0.921	0.716	0.697	0.825	0.865	0.948	0.501	0.482	0.520	0.724
SFME (Yu et al., 2025)	0.982	0.938	0.541	0.523	0.625	0.820	0.960	0.925	0.723	0.705	0.830	0.869	0.951	0.512	0.494	0.530	0.731
Ours	0.989	0.953	0.573	0.556	0.652	0.838	0.973	0.952	0.792	0.771	0.875	0.906	0.966	0.561	0.541	0.579	0.764

The best results are shown in bold. During the training phase, only 10% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

\vskip1.8pc ?> Table 3.

Comparison of Accuracy Results with SOTA Methods on Three Datasets.

	CUB-200-2011 (CUB)						FGVC-Aircraft (AIR)						Stanford Cars (CAR)
Method	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$F 1$	$wAP$	$mAP$
HMC-LMLP (Cerri et al., 2016)	0.983	0.938	0.480	0.463	0.575	0.800	0.970	0.937	0.816	0.794	0.881	0.908	0.969	0.415	0.401	0.439	0.692
HMCN (Wehrmann et al., 2018)	0.972	0.913	0.529	0.507	0.610	0.805	0.958	0.905	0.784	0.761	0.852	0.882	0.930	0.527	0.508	0.545	0.729
C-HMCNN (Giunchiglia & Lukasiewicz, 2020)	0.980	0.939	0.501	0.486	0.592	0.807	0.967	0.940	0.801	0.776	0.875	0.903	0.957	0.432	0.417	0.455	0.695
Chang et al. (Chang et al., 2021)	0.967	0.917	0.700	0.679	0.747	0.861	0.588	0.938	0.830	0.806	0.832	0.785	0.929	0.761	0.742	0.768	0.845
HRN (Chen et al., 2022)	0.984	0.939	0.740	0.721	0.783	0.888	0.968	0.942	0.845	0.823	0.897	0.918	0.961	0.837	0.818	0.842	0.899
MCG (Yang et al., 2024)	0.985	0.942	0.744	0.726	0.785	0.890	0.969	0.945	0.850	0.827	0.900	0.921	0.962	0.840	0.821	0.845	0.901
SFME (Yu et al., 2025)	0.986	0.945	0.751	0.733	0.790	0.894	0.971	0.948	0.856	0.833	0.905	0.925	0.963	0.843	0.824	0.848	0.903
Ours	0.989	0.961	0.776	0.759	0.815	0.909	0.975	0.959	0.885	0.866	0.924	0.940	0.965	0.851	0.832	0.856	0.908

The best results are shown in bold. During the training phase, only 30% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

Table 4.

Comparison of Accuracy Results with SOTA Methods on Three Datasets.

Method	CUB-200-2011 (CUB)						FGVC-Aircraft (AIR)						Stanford Cars (CAR)
	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$F 1$	$wAP$	$mAP$
HMC-LMLP (Cerri et al., 2016)	0.984	0.938	0.643	0.612	0.705	0.855	0.972	0.938	0.836	0.806	0.892	0.915	0.969	0.665	0.641	0.678	0.817
HMCN (Wehrmann et al., 2018)	0.967	0.909	0.643	0.613	0.700	0.840	0.957	0.921	0.815	0.788	0.873	0.898	0.935	0.730	0.701	0.739	0.833
C-HMCNN (Giunchiglia & Lukasiewicz, 2020)	0.983	0.941	0.675	0.646	0.731	0.866	0.965	0.939	0.852	0.821	0.899	0.919	0.960	0.702	0.678	0.713	0.831
Chang et al. (Chang et al., 2021)	0.974	0.935	0.793	0.771	0.824	0.901	0.736	0.944	0.867	0.838	0.874	0.849	0.956	0.881	0.855	0.884	0.919
HRN (Chen et al., 2022)	0.979	0.943	0.805	0.781	0.835	0.909	0.973	0.957	0.897	0.871	0.929	0.942	0.959	0.887	0.862	0.890	0.923
MCG (Yang et al., 2024)	0.987	0.949	0.811	0.787	0.841	0.916	0.974	0.958	0.899	0.873	0.930	0.944	0.963	0.896	0.870	0.899	0.929
SFME (Yu et al., 2025)	0.988	0.951	0.815	0.791	0.845	0.918	0.975	0.959	0.901	0.875	0.932	0.945	0.965	0.901	0.875	0.904	0.933
Ours	0.990	0.960	0.826	0.804	0.855	0.925	0.977	0.961	0.907	0.882	0.936	0.948	0.971	0.910	0.884	0.913	0.941

The best results are shown in bold. During the training phase, only 50% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

\vskip1.8pc ?> Table 5.

Comparison of Accuracy Results with SOTA Methods on Three Datasets.

	CUB-200-2011 (CUB)						FGVC-Aircraft (AIR)						Stanford Cars (CAR)
Method	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$F 1$	$wAP$	$mAP$
HMC-LMLP (Cerri et al., 2016)	0.982	0.936	0.713	0.685	0.761	0.877	0.969	0.933	0.854	0.826	0.899	0.919	0.969	0.792	0.768	0.800	0.881
HMCN (Wehrmann et al., 2018)	0.968	0.920	0.712	0.686	0.757	0.867	0.961	0.927	0.854	0.829	0.896	0.914	0.944	0.816	0.789	0.822	0.880
C-HMCNN (Giunchiglia & Lukasiewicz, 2020)	0.980	0.939	0.749	0.720	0.790	0.889	0.968	0.943	0.884	0.858	0.917	0.932	0.962	0.819	0.792	0.825	0.891
Chang et al. (Chang et al., 2021)	0.978	0.941	0.825	0.800	0.858	0.915	0.874	0.944	0.893	0.866	0.908	0.904	0.962	0.916	0.890	0.918	0.939
HRN (Chen et al., 2022)	0.983	0.948	0.839	0.816	0.863	0.923	0.973	0.955	0.916	0.891	0.938	0.948	0.961	0.906	0.881	0.908	0.934
MCG (Yang et al., 2024)	0.988	0.952	0.842	0.818	0.867	0.927	0.973	0.957	0.917	0.891	0.939	0.949	0.966	0.918	0.892	0.920	0.942
SFME (Yu et al., 2025)	0.989	0.955	0.845	0.821	0.870	0.930	0.974	0.958	0.918	0.893	0.940	0.950	0.968	0.925	0.901	0.927	0.947
Ours	0.990	0.962	0.852	0.829	0.876	0.935	0.974	0.959	0.920	0.895	0.942	0.951	0.973	0.934	0.910	0.936	0.954

The best results are shown in bold. During the training phase, 70% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

Table 6.

Comparison of Accuracy Results with SOTA Methods on Three Datasets.

	CUB-200-2011 (CUB)						FGVC-Aircraft (AIR)						Stanford Cars (CAR)
Method	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$F 1$	$wAP$	$mAP$
HMC-LMLP (Cerri et al., 2016)	0.985	0.942	0.796	0.770	0.828	0.908	0.971	0.944	0.903	0.878	0.928	0.939	0.970	0.877	0.852	0.881	0.924
HMCN (Wehrmann et al., 2018)	0.973	0.932	0.798	0.771	0.827	0.901	0.961	0.926	0.872	0.849	0.904	0.920	0.952	0.887	0.860	0.890	0.920
C-HMCNN (Giunchiglia & Lukasiewicz, 2020)	0.985	0.946	0.816	0.792	0.844	0.916	0.975	0.954	0.917	0.891	0.939	0.949	0.968	0.906	0.880	0.909	0.937
Chang et al. (Chang et al., 2021)	0.978	0.942	0.856	0.832	0.875	0.925	0.969	0.953	0.919	0.893	0.938	0.947	0.964	0.937	0.912	0.938	0.951
HRN (Chen et al., 2022)	0.987	0.955	0.866	0.840	0.886	0.936	0.975	0.958	0.926	0.900	0.945	0.953	0.974	0.936	0.913	0.938	0.955
MCG (Yang et al., 2024)	0.991	0.964	0.876	0.850	0.895	0.943	0.979	0.962	0.938	0.912	0.952	0.960	0.977	0.948	0.922	0.949	0.963
SFME (Yu et al., 2025)	0.992	0.964	0.877	0.852	0.896	0.944	0.979	0.961	0.937	0.910	0.951	0.958	0.977	0.947	0.921	0.948	0.962
Ours	0.992	0.964	0.878	0.854	0.897	0.945	0.979	0.961	0.938	0.913	0.952	0.959	0.977	0.948	0.923	0.949	0.963

The best results are shown in bold. During the training phase, 100% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

\vskip1.8pc ?> Table 7.

Accuracy Results on Three Datasets Compared with SOTA Methods.

		CUB-200-2011 (CUB)						FGVC-Aircraft (AIR)						Stanford Cars (CAR)
Ratio	Method	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$	$mAP$	$P_{1}$	$P_{2}$	$F 1$	$wAP$	$mAP$
5%	HRN (Chen et al., 2022)	0.977	0.940	0.488	0.462	0.580	0.802	0.967	0.923	0.675	0.651	0.806	0.855	0.956	0.306	0.288	0.344	0.631
	MCG (Yang et al., 2024)	0.979	0.941	0.492	0.470	0.585	0.804	0.969	0.930	0.689	0.662	0.815	0.863	0.958	0.321	0.302	0.355	0.641
	SFME (Yu et al., 2025)	0.980	0.942	0.499	0.474	0.592	0.808	0.970	0.935	0.701	0.675	0.825	0.869	0.960	0.345	0.322	0.372	0.652
	Ours	0.987	0.945	0.518	0.496	0.606	0.817	0.972	0.947	0.742	0.718	0.848	0.887	0.967	0.380	0.358	0.406	0.674
1%	HRN (Chen et al., 2022)	0.983	0.938	0.313	0.284	0.440	0.745	0.962	0.922	0.507	0.478	0.720	0.797	0.932	0.094	0.075	0.131	0.513
	MCG (Yang et al., 2024)	0.984	0.940	0.325	0.298	0.452	0.750	0.966	0.931	0.541	0.516	0.745	0.813	0.941	0.118	0.096	0.152	0.530
	SFME (Yu et al., 2025)	0.985	0.941	0.334	0.309	0.461	0.753	0.968	0.938	0.565	0.540	0.762	0.824	0.950	0.142	0.121	0.175	0.547
	Ours	0.986	0.944	0.355	0.332	0.475	0.762	0.973	0.950	0.623	0.592	0.790	0.849	0.967	0.185	0.160	0.219	0.576

The best results are shown in bold. During the training phase, 5% or 1% of the samples have fine-grained labels. FGVC = fine-grained visual classification; wAP = weighted average precision; mAP = mean average precision.

As the proportion of fine-grained labels increases (Tables 4 to 6), all methods show improved performance, yet our approach maintains a consistent advantage. For example, even when all training samples have complete fine-grained labels (Table 6), we still achieve a clear, albeit smaller, margin of improvement in the $wAP$ metric across CUB, AIR, and CAR over the latest SOTA methods. This indicates that while having more data helps all models, our method provides a more powerful mechanism for feature learning that is beneficial even in data-rich settings. By explicitly enhancing interclass discrimination ( $L_{hd}$ ) and regularizing intragroup feature learning ( $L_{ig}$ ), our method discovers more robust and discriminative features, leading to superior recognition regardless of the annotation ratio.

To further stress-test our method, we conducted experiments with extremely few fine-grained labels (5% and 1%), with the results shown in Table 7. In these challenging settings, the advantage of our approach becomes even clearer. With only 1% of fine-grained labels, our method improves the $wAP$ over the strongest baseline by a substantial 1.4% on CUB, 2.8% on AIR, and a remarkable 4.4% on CAR. The $F 1$ -score improvements are equally significant, further confirming the robustness of our method. These results strongly confirm that our loss functions are not merely leveraging existing labels but are actively promoting generalization and combating the severe overfitting that is typical in scenarios with extremely limited labeled data.

Finally, to demonstrate scalability to deeper hierarchies, we experimented on the four-level Butterfly-200 dataset (Table 8). Our method again significantly outperforms the recent baselines across all seven annotation ratios. The improvements are most dramatic in the low-data regime, with $wAP$ gains of 3.6% (at 10% labels), 6.6% (at 5% labels), and 4.1% (at 1% labels) over the best-performing SOTA method. Consistently higher $F 1$ -scores across all ratios also highlight the method’s effectiveness in maintaining class-wise performance. This confirms that our approach effectively leverages label hierarchies, regardless of their depth, to guide feature learning, especially when fine-grained annotations are scarce.

Table 8.

Accuracy Results on the Butterfly-200 Dataset Compared with SOTA Methods.

		Butterfly-200
Ratio	Method	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$F 1$	$wAP$	$mAP$
100%	HRN (Chen et al., 2022)	0.990	0.977	0.950	0.855	0.835	0.897	0.943
	MCG (Yang et al., 2024)	0.990	0.978	0.951	0.859	0.839	0.899	0.944
	SFME (Yu et al., 2025)	0.991	0.978	0.952	0.861	0.842	0.901	0.945
	Ours	0.992	0.980	0.955	0.867	0.848	0.906	0.949
70%	HRN (Chen et al., 2022)	0.990	0.977	0.949	0.845	0.824	0.891	0.940
	MCG (Yang et al., 2024)	0.990	0.977	0.950	0.847	0.826	0.892	0.941
	SFME (Yu et al., 2025)	0.991	0.978	0.951	0.849	0.829	0.894	0.942
	Ours	0.992	0.979	0.953	0.853	0.833	0.897	0.944
50%	HRN (Chen et al., 2022)	0.990	0.976	0.944	0.818	0.796	0.874	0.932
	MCG (Yang et al., 2024)	0.990	0.976	0.946	0.822	0.800	0.877	0.934
	SFME (Yu et al., 2025)	0.991	0.977	0.948	0.827	0.805	0.881	0.936
	Ours	0.992	0.978	0.951	0.836	0.814	0.887	0.939
30%	HRN (Chen et al., 2022)	0.988	0.971	0.940	0.770	0.748	0.844	0.917
	MCG (Yang et al., 2024)	0.989	0.973	0.942	0.781	0.759	0.851	0.921
	SFME (Yu et al., 2025)	0.990	0.975	0.945	0.789	0.765	0.858	0.925
	Ours	0.992	0.978	0.951	0.802	0.781	0.867	0.931
10%	HRN (Chen et al., 2022)	0.966	0.962	0.861	0.555	0.528	0.689	0.836
	MCG (Yang et al., 2024)	0.975	0.965	0.885	0.589	0.562	0.721	0.852
	SFME (Yu et al., 2025)	0.980	0.968	0.902	0.611	0.589	0.743	0.865
	Ours	0.988	0.972	0.933	0.663	0.638	0.779	0.889
5%	HRN (Chen et al., 2022)	0.951	0.902	0.819	0.418	0.389	0.593	0.773
	MCG (Yang et al., 2024)	0.968	0.925	0.854	0.498	0.468	0.655	0.811
	SFME (Yu et al., 2025)	0.975	0.943	0.881	0.535	0.509	0.692	0.833
	Ours	0.988	0.973	0.936	0.625	0.601	0.758	0.881
1%	HRN (Chen et al., 2022)	0.976	0.947	0.893	0.391	0.365	0.607	0.802
	MCG (Yang et al., 2024)	0.979	0.955	0.905	0.451	0.428	0.643	0.820
	SFME (Yu et al., 2025)	0.981	0.960	0.915	0.485	0.462	0.671	0.835
	Ours	0.988	0.973	0.939	0.544	0.521	0.712	0.861

The best results are shown in bold. During the training phase, different ratios of samples have fine-grained labels. wAP = weighted average precision; mAP = mean average precision.

4.6. Ablation Study

4.6.1. Contribution of $L_{hd}$ and $L_{ig}$

To isolate the contribution of each proposed loss function, we conducted an ablation study using the strong HRN (Chen et al., 2022) model as our baseline. We sequentially added $L_{hd}$ and $L_{ig}$ and evaluated the performance on the CUB, AIR, and CAR datasets with 100% fine-grained labels. The results, reported at the finest granularity in Table 9, show that adding either $L_{hd}$ or $L_{ig}$ individually provides a significant performance boost over the baseline. Combining both losses yields the best results, demonstrating their complementary benefits. The $L_{hd}$ component improves performance by enforcing hierarchical consistency in the feature space, while the $L_{ig}$ component encourages the model to discover more robust features for distinguishing highly similar in-group categories.

Table 9.
The Effect of the Two Proposed Loss Functions on Three Datasets.

Accuracy/ $F 1$ -Score

Baseline $L_{hd}$ $L_{ig}$ CUB AIR CAR

✓ 0.865/0.841 0.926/0.898 0.936/0.915

✓ ✓ 0.869/0.845 0.937/0.910 0.943/0.923

✓ ✓ 0.869/0.844 0.929/0.902 0.942/0.922

✓ ✓ ✓ 0.878/0.854 0.938/0.913 0.948/0.929

			Accuracy/ $F 1$ -Score
✓			0.865/0.841	0.926/0.898	0.936/0.915
✓	✓		0.869/0.845	0.937/0.910	0.943/0.923
✓		✓	0.869/0.844	0.929/0.902	0.942/0.922
✓	✓	✓	0.878/0.854	0.938/0.913	0.948/0.929

The best results are shown in bold. In this experiment, we use HRN (Chen et al., 2022) as the baseline model.

To further validate that each loss function addresses its intended problem, we performed a targeted error analysis. We define an “out-of-group error” as a misclassification where the predicted category and the ground-truth category belong to different superclasses, and an “in-group error” as one where they share the same superclass. As hypothesized, $L_{hd}$ is designed to reduce out-of-group errors, while $L_{ig}$ targets in-group errors. The results in Figure 6 confirm our design motivation. The introduction of $L_{hd}$ leads to a marked reduction in out-of-group errors (10.3% on CUB, 11.5% on AIR, and 12.9% on CAR). Subsequently, adding $L_{ig}$ primarily reduces the remaining in-group errors (6.6% on CUB, 3.3% on AIR, and 11.9% on CAR). We note that on the AIR dataset, adding $L_{ig}$ slightly increases out-of-group errors; we speculate this is because many aircraft models lack in-group siblings, limiting the applicability of in-group regularization.

Figure 6.

The number of out-of-group and in-group errors for the baseline model on three datasets, with and without our proposed loss functions.

4.6.2. Analysis of In-Group Regularization Strategy

We investigated several strategies for selecting the distractor feature in $L_{ig}$ , governed by the function $R (\cdot)$ . Table 10 compares the default random sampling strategy against two alternatives: using the average of all in-group features, and using the feature from the most similar (highest predicted probability) in-group class. The results show that random sampling is the most effective strategy. By exposing the model to a diverse range of confusing distractors during training, random sampling encourages the learning of more robust and generalizable features that are better suited for distinguishing between challenging in-group categories.

Table 10.
Comparison of Different Regularization Methods.

Accuracy/ $F 1$ -Score

Method CUB CAR

Baseline 0.887/0.860 0.943/0.923

$+ L_{ig}$ (randomly choosing noise category) 0.891/0.866 0.950/0.931

$+ L_{ig}$ (choosing the most similar category) 0.888/0.862 0.943/0.924

$+ L_{ig}$ (taking the average of in-group categories) 0.889/0.863 0.944/0.925

	Accuracy/ $F 1$ -Score
Baseline	0.887/0.860	0.943/0.923
$+ L_{ig}$ (randomly choosing noise category)	0.891/0.866	0.950/0.931
$+ L_{ig}$ (choosing the most similar category)	0.888/0.862	0.943/0.924
$+ L_{ig}$ (taking the average of in-group categories)	0.889/0.863	0.944/0.925

The best results are shown in bold. HRN (Chen et al., 2022) combined with $L_{hd}$ is used as the baseline model.

4.6.3. Hyperparameter Sensitivity Analysis

We analyzed the model’s sensitivity to its key hyperparameters: the noise weight $ω$ in $L_{ig}$ , the dropout ratio $γ$ in CWD, and the overall loss weights $λ_{ce}$ , $λ_{hd}$ , and $λ_{ig}$ .

Analysis of $ω$ : The hyperparameter $ω$ controls the intensity of the noise feature mixed into the target feature in $L_{ig}$ . As shown in Figure 7, we tested values of $ω$ ranging from 0.1 to 0.9. The best fine-grained accuracy was achieved on the CUB and AIR datasets at $ω = 0.2$ . On the CAR dataset, performance was similar for $ω = 0.2$ and $ω = 0.5$ . We therefore selected $ω = 0.2$ for all experiments.

We also acknowledge the sensitivity of the hyperparameter $ω$ , as indicated in our experiments. While extensive grid search is ideal, a practical heuristic for tuning $ω$ on new datasets is to start with a low value (e.g., 0.1) and incrementally increase it until training instability or performance degradation is observed. Since $ω$ represents the level of “noise” injected, datasets with higher interclass similarity (harder fine-grained tasks) may tolerate or even benefit from slightly higher $ω$ values to force stronger discriminative learning, whereas easier datasets might require lower values to avoid overwhelming the target features.

Analysis of $γ$ : The hyperparameter $γ$ determines the proportion of correlated feature channels to be masked by CWD. We evaluated its impact on the CUB dataset, with results shown in Table 11. While a higher value of $γ = 0.7$ yielded the best accuracy at the coarse level ( $P_{2}$ ), our primary goal is fine-grained accuracy. The optimal performance at the finest level ( $P_{3}$ ), as well as the best overall $mAP$ , was achieved with $γ = 0.3$ , which we used as the default value.

Figure 7.

Sensitivity test for $ω$ on three datasets.

Table 11.

Analysis of $γ$ on the CUB-200-2011 Dataset.

	CUB-200-2011 (CUB)
$γ$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$
0.1	0.991	0.964	0.873	0.849	0.893
0.3	0.992	0.964	0.878	0.854	0.897
0.5	0.989	0.964	0.875	0.851	0.894
0.7	0.990	0.965	0.874	0.850	0.894
0.9	0.988	0.961	0.869	0.845	0.890

The best results are shown in bold. wAP = weighted average precision.

Analysis of Loss Weights: We also analyzed the weights for the three loss components. The results on the CUB dataset are shown in Table 12. The baseline model corresponds to $λ_{ce} = 1, λ_{hd} = 0, λ_{ig} = 0$ . Adding our proposed losses with a weight of 0.5 for all three components yielded the highest accuracy across all hierarchical levels and the best overall performance according to both $wAP$ and $mAP$ . The $F 1$ -score results are consistent with these findings, peaking at the same configuration. Increasing the weights of our proposed losses further did not lead to additional gains. Therefore, we set $λ_{ce} = λ_{hd} = λ_{ig} = 0.5$ in our experiments. We note that setting all weights to 1 and halving the learning rate produced similarly optimal results.

Table 12.

Analysis of $λ_{ce}$ , $λ_{hd}$ , and $λ_{ig}$ on the CUB-200-2011 Dataset.

			CUB-200-2011 (CUB)
$λ_{ce}$	$λ_{hd}$	$λ_{ig}$	$P_{1}$	$P_{2}$	$P_{3}$	$F 1$	$wAP$
0.5	0.5	0.5	0.992	0.964	0.878	0.854	0.897
0.5	0.5	1	0.991	0.959	0.867	0.843	0.887
0.5	1	0.5	0.990	0.959	0.870	0.846	0.889
0.5	1	1	0.989	0.959	0.870	0.846	0.889
1	0	0	0.987	0.955	0.866	0.842	0.886

The best results are shown in bold. wAP = weighted average precision.

4.6.4. Computational Complexity Analysis

To evaluate the computational cost of our method, we analyzed the increase in the number of parameters and the training time per epoch compared to the HRN baseline. Our proposed loss functions, $L_{hd}$ and $L_{ig}$ , are themselves parameter-free. The main architectural components, including the backbone and the hierarchy-specific branches, are consistent with the baseline model, meaning our method introduces a negligible number of additional parameters.

The primary computational overhead stems from the calculation of the loss terms during the forward and backward passes. For $L_{hd}$ , this includes computing the channel correlation matrix for CWD and the hierarchical feature fusion. For $L_{ig}$ , it involves identifying in-group classes, sampling a distractor feature, and performing a forward pass over the small subset of in-group classes. We measured the training time on the CUB-200-2011 dataset with 100% fine-grained labels. Compared to the HRN baseline, our full method (HRN + $L_{hd}$ + $L_{ig}$ ) resulted in an approximate 7% increase in wall-clock training time per epoch on an Nvidia GeForce RTX 3090 GPU. This modest increase demonstrates that our method is computationally efficient and that the significant performance improvements are achieved with a reasonable and practical trade-off in terms of training cost.

Regarding inference efficiency, as detailed in Section 4.2, all auxiliary branches and loss modules are removed during the testing phase. Thus, the inference latency and FLOPs are identical to the baseline backbone, incurring no additional computational cost for deployment.

4.6.5. Qualitative Results

To gain a more intuitive understanding of how our proposed loss functions affect the model’s decision-making process, we supplement the quantitative results with a qualitative analysis of the attention maps. As shown in Figure 5, we can observe distinct changes in the model’s focus as each component is added. The baseline HRN model, while generally focusing on the object, often produces diffuse attention maps that cover broad, less discriminative regions. This suggests that without explicit hierarchical guidance, the model may struggle to pinpoint the most informative parts for fine-grained distinction.

Upon introducing the hierarchically discriminative loss ( $L_{hd}$ ), the attention becomes significantly more concentrated and precise. The model learns to focus on key parts that are crucial for classification, such as the head and beak of the bird, which carry strong discriminative signals. This sharpened focus is a direct result of $L_{hd}$ enforcing consistency with the label hierarchy, effectively guiding the model to learn features that are not only relevant for the fine-grained category but also for its parent classes.

Finally, with the addition of the in-group regularization loss ( $L_{ig}$ ), the model’s attention, while still focused, expands to cover a more diverse set of complementary regions. For instance, in addition to the head, the model now pays more attention to other parts like the wing patterns or tail feathers. This behavior is consistent with the design of $L_{ig}$ , which encourages the model to seek out more subtle, unique features to overcome the confusion introduced by in-group distractors. By learning a more comprehensive and robust set of features, the model becomes less reliant on a single-dominant part and is better equipped to handle the subtle interclass variations inherent in FGVC.

4.7. Comparison with Traditional FGVC Methods

To demonstrate the versatility of our proposed loss functions, we evaluated their effectiveness in a traditional (nonhierarchical) FGVC setting. For this, we treated the label hierarchy as having a single level (the finest granularity) and integrated our losses into two strong backbone architectures: ResNet-50 (He et al., 2016), representing traditional CNNs, and the more recent CSWin Transformer (Dong et al., 2022), representing hierarchical Vision Transformers. We then compared their performance against a range of SOTA methods on three standard benchmarks.

It is worth noting that while ViT-B (Han et al., 2022) is a popular Transformer baseline, we chose CSWin-L for our Transformer experiments because its hierarchical design is naturally aligned with our method’s objective of leveraging multiscale features. Standard ViT architectures process images as a flat sequence of patches with a constant resolution, lacking the inherent hierarchical feature maps that our $L_{hd}$ and $L_{ig}$ losses are designed to exploit efficiently. CSWin Transformer, on the other hand, produces hierarchical feature representations similar to CNNs (e.g., ResNet), making it a more suitable and powerful candidate for demonstrating the generalizability of our hierarchical loss functions.

As shown in Table 13, simply adding our loss functions significantly boosts the performance of both backbones, achieving results that are competitive with or superior to recent, more complex models. For instance, when applied to ResNet-50, our method achieves SOTA accuracy on the AIR and CAR datasets, outperforming even the latest Transformer-based methods like MCG (Yang et al., 2024). This is particularly notable because our approach achieves these results without requiring complex network modifications or extra annotations (such as the human gaze data used by CHRF; Liu et al., 2022), highlighting the power of leveraging the inherent (even if simple) label hierarchy. The performance gains are also evident on the stronger CSWin-L backbone, where our method improves the baseline on all three datasets, demonstrating that our approach is not limited to CNNs and can effectively enhance Transformer-based models as well. This underscores the model-agnostic nature and broad applicability of our proposed hierarchical losses.

Table 13.
Performance Comparison on Three Fine-Grained Visual Classification Benchmarks.

Accuracy/ $F 1$ -Score

Method Backbone CUB AIR CAR

MC-Loss (Chang et al., 2020) ResNet50 0.873/0.849 0.926/0.899 0.937/0.916

SPS (Huang et al., 2021) ResNet50 0.887/0.862 0.927/0.900 0.949/0.928

Chang et al. (Chang et al., 2021) ResNet50 0.856/0.832 0.920/0.894 0.937/0.916

M2B (Liang et al., 2022) ResNet50 0.882/0.858 0.942/0.915 0.929/0.908

CHRF (Liu et al., 2022) ResNet50 0.892/0.867 0.936/0.909 0.952/0.931

HRN (Chen et al., 2022) ResNet50 0.866/0.842 0.929/0.902 0.936/0.915 ViT-B 0.917/0.892 - 0.948/0.927

TransFG (He et al., 2022) ViT-B 0.917/0.892 - 0.948/0.927

RAMS-Trans (Hu et al., 2021) ViT-B 0.913/0.888 - -

FFVT (Wang et al., 2021) ViT-B 0.916/0.891 - -

DCAL (Zhu et al., 2022) ViT-B 0.914/0.889 0.915/0.889 0.934/0.913

MCG (Yang et al., 2024) ViT-B 0.918/0.893 0.939/0.912 0.949/0.928

Ours ResNet50 0.891/0.866 0.947/0.920 0.951/0.930

Baseline CSWin-L 0.923/0.898 0.908/0.882 0.940/0.919

Ours CSWin-L 0.926/0.901 0.921/0.895 0.947/0.926

		Accuracy/ $F 1$ -Score
MC-Loss (Chang et al., 2020)	ResNet50	0.873/0.849	0.926/0.899	0.937/0.916
SPS (Huang et al., 2021)	ResNet50	0.887/0.862	0.927/0.900	0.949/0.928
Chang et al. (Chang et al., 2021)	ResNet50	0.856/0.832	0.920/0.894	0.937/0.916
M2B (Liang et al., 2022)	ResNet50	0.882/0.858	0.942/0.915	0.929/0.908
CHRF (Liu et al., 2022)	ResNet50	0.892/0.867	0.936/0.909	0.952/0.931
HRN (Chen et al., 2022)	ResNet50	0.866/0.842	0.929/0.902	0.936/0.915	ViT-B	0.917/0.892	-	0.948/0.927
TransFG (He et al., 2022)	ViT-B	0.917/0.892	-	0.948/0.927
RAMS-Trans (Hu et al., 2021)	ViT-B	0.913/0.888	-	-
FFVT (Wang et al., 2021)	ViT-B	0.916/0.891	-	-
DCAL (Zhu et al., 2022)	ViT-B	0.914/0.889	0.915/0.889	0.934/0.913
MCG (Yang et al., 2024)	ViT-B	0.918/0.893	0.939/0.912	0.949/0.928
Ours	ResNet50	0.891/0.866	0.947/0.920	0.951/0.930
Baseline	CSWin-L	0.923/0.898	0.908/0.882	0.940/0.919
Ours	CSWin-L	0.926/0.901	0.921/0.895	0.947/0.926

The best results are shown in bold.

4.8. Visualization

To visually interpret the effects of our loss functions, we used Grad-CAM (Selvaraju et al., 2017) to generate attention map visualizations, shown in Figure 5. The baseline HRN model often focuses on broad, common features. The addition of our hierarchically discriminative loss ( $L_{hd}$ ) sharpens the model’s focus, directing its attention to the most discriminative object parts (e.g., the bird’s head and beak), which are critical for fine-grained distinction. This demonstrates that $L_{hd}$ effectively guides the model to learn features that align with the label hierarchy. Furthermore, adding the in-group regularization loss ( $L_{ig}$ ) encourages the model to discover a more diverse set of complementary features. This is visible as a more diffuse attention response that covers additional object parts, leading to a more comprehensive and robust object representation.

5. Discussion

While our proposed method has demonstrated strong performance, it is important to acknowledge its limitations and consider avenues for future research. One potential limitation is the reliance on a predefined, clean label hierarchy. In real-world scenarios, such hierarchies may be noisy, incomplete, or not available at all. The performance of our method could be affected by the quality of the hierarchy, and future work could explore methods to automatically learn or refine hierarchies from data.

Another aspect to consider is the scalability of the approach to datasets with extremely deep or large-scale hierarchies. While our experiments on the four-level Butterfly-200 dataset show promising results, the computational complexity, especially for the in-group regularization, might increase with the number of classes at each level. Investigating more efficient sampling strategies or alternative regularization techniques for large-scale hierarchies would be a valuable direction.

Furthermore, our current framework is built upon a CNN-based architecture. While we have shown its effectiveness, exploring the integration of our loss functions with Transformer-based backbones could be a promising future direction. The global attention mechanism in Transformers may interact with our hierarchical losses in interesting ways, potentially leading to further performance gains.

Finally, the core ideas of hierarchical feature discrimination and in-group regularization could be extended beyond classification. Future work could explore their application in other fine-grained tasks, such as retrieval or zero-shot learning, where leveraging label hierarchies can provide crucial semantic constraints. We believe that explicitly modeling hierarchical relationships is a key direction for advancing fine-grained visual understanding.

Future work will focus on extending this approach to deeper label hierarchies and a wider range of application domains. We will also explore its integration with semisupervised and active learning paradigms, its generalization to Transformer and multimodal architectures, and the development of a stronger theoretical foundation for hierarchical fine-grained learning.

6. Conclusion

In this paper, we introduced a dual-pronged approach to address the key challenges of out-of-group and in-group errors in FGVC, particularly in scenarios involving label hierarchies. Our primary contribution is the proposal of two novel, complementary loss functions: the hierarchically discriminative loss ( $L_{hd}$ ) and the in-group regularization loss ( $L_{ig}$ ). The $L_{hd}$ loss effectively leverages the coarse-to-fine label structure by encouraging fine-grained features to inherit attributes from their ancestral classes. This hierarchical feature fusion enforces a semantic constraint, improving the model’s ability to distinguish between categories from different superclasses and thereby reducing out-of-group errors. Concurrently, the $L_{ig}$ loss tackles the more challenging task of separating visually similar in-group categories. By introducing controlled noise through feature mixing with confusing in-group distractors, it regularizes the feature space, pushing the model to learn more robust and truly discriminative representations, thus mitigating in-group errors and reducing overfitting.

A key advantage of our method is its simplicity and ease of integration. The proposed loss functions can be readily incorporated into existing multilevel classification frameworks for end-to-end training without requiring any complex architectural modifications or additional data annotations, such as bounding boxes or part labels. Our extensive experiments, conducted on five diverse and challenging benchmark datasets, demonstrated that this approach consistently improves classification accuracy across various metrics. The performance gains were particularly significant in challenging low-data regimes, where only a small fraction of training samples had complete fine-grained labels, highlighting the strong regularization effect and data efficiency of our method. The ablation studies and attention map visualizations further validated our design, confirming that the two loss functions work synergistically to guide the model toward a more accurate and robust understanding of fine-grained categories. We believe this work provides a valuable and effective strategy for leveraging label hierarchies in FGVC.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions, which have helped to improve the quality of this manuscript.

Ethical Considerations

This article does not contain any studies with human or animal participants.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The datasets used in this study are publicly available at CUB-200-2011 (CUB), Butterfly-200, FGVC-Aircraft (AIR), StanfordCars (CAR), and . The specific hierarchical label mappings generated for this work are available from the corresponding author upon reasonable request.

References

Atban

Ekinci

Garip

(2023). Traditional machine learning algorithms for breast cancer image classification with optimized deep features. Biomedical Signal Processing and Control, 81, 104534.

Cerri

Barros

R. C.

PLF de Carvalho

A. C.

Jin

(2016). Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinformatics, 17(1), 373.

Chang

Ding

Xie

Bhunia

A. K.

Guo

Song

Y. Z.

(2020). The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29, 4683–4695.

Chang

Pang

Zheng

Song

Y. Z.

Guo

(2021). Your “flamingo” is my “bird”: Fine-grained, or not. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11476–11485).

Chang

Y. T.

Cheng

W. H.

Hua

K. L.

(2017). Fashion world map: Understanding cities through streetwear fashion. In Proceedings of the 25th ACM international conference on multimedia (pp. 91–99).

Chen

Wang

Liu

Qian

(2022). Label relation graphs enhanced hierarchical residual network for hierarchical multi-granularity classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4858–4867).

Chen

Weng

S. E.

Peng

C. J.

Shuai

H. H.

Cheng

W. H.

(2021). ZYELL-NCTU NetTraffic-1.0: A large-scale dataset for real-world network anomaly detection. In 2021 IEEE international conference on consumer electronics-Taiwan (ICCE-TW) (pp. 1–2). IEEE.

Chen

Gao

Dong

Luo

Lin

(2018). Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In Proceedings of the 26th ACM international conference on multimedia (pp. 2023–2031).

Deng

Dong

Socher

L. J.

Fei-Fei

(2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

10.

Dong

Bao

Chen

Zhang

Yuan

Chen

Guo

(2022). CSWin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12124–12134).

11.

Doyle

Agner

Madabhushi

Feldman

Tomaszewski

(2008). Automated grading of breast cancer histopathology using spectral clustering with textural and architectural image features. In 2008 5th IEEE international symposium on biomedical imaging: From nano to macro (pp. 496–499). IEEE.

12.

Gandomkar

Brennan

P. C.

Mello-Thoms

(2018). MuDeRN: Multi-category classification of breast histopathological image using deep residual networks. Artificial Intelligence in Medicine, 88, 14–24.

13.

Giunchiglia

Lukasiewicz

(2020). Coherent hierarchical multi-label classification networks. Advances in Neural Information Processing Systems, 33, 9662–9673.

14.

Han

Wang

Chen

Guo

Liu

Tang

Xiao

Yang

Zhang

Tao

(2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 87–110.

15.

Han

Wei

Zheng

Yin

(2017). Breast cancer multi-classification from histopathological images with structured deep learning model. Scientific Reports, 7(1), 4172.

16.

Chen

J. N.

Liu

Kortylewski

Yang

Bai

Wang

(2022). TransFG: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 852–860).

17.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

18.

Hou

Wang

Chen

Xue

J. H.

Zhu

Yang

(2019). Spatial–temporal attention Res-TCN for skeleton-based dynamic hand gesture recognition. In L. Leal-Taixé & S. Roth (Eds.), Computer vision—ECCV 2018 workshops (pp. 273–286). Springer International Publishing.

19.

Hou

(2020). Breast cancer pathological image classification based on deep learning. Journal of X-ray Science and Technology, 28(4), 727–738.

20.

Huang

(2019). See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891.

21.

Jiang

Liu

Luo

Cao

Zhang

(2025). Hierarchical self-distilled feature learning for fine-grained visual categorization. IEEE Transactions on Neural Networks and Learning Systems, 36(3), 4005–4018

22.

Jin

Zhang

Hong

Zhang

Xue

(2021). RAMS-Trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM international conference on multimedia (pp. 4239–4248).

23.

Huang

Wang

Tao

(2021). Stochastic partial swap: Enhanced model generalization and interpretability for fine-grained recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 620–629).

24.

Iqbal

Qureshi

A. N.

Aurangzeb

Alhussein

Anwar

M. S.

Zhang

Syed

(2024). Adaptive magnification network for precise tumor analysis in histopathological images. Computers in Human Behavior, 156, 108222.

25.

Krause

Stark

Deng

Fei-Fei

(2013). 3D object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops (pp. 554–561).

26.

Zhu

Zhang

(2023). Pathological image classification via embedded fusion mutual learning. Biomedical Signal Processing and Control, 79, 104181.

27.

Pan

Yang

Liu

Fan

Cao

Zhang

(2020). Multi-task deep learning for fine-grained classification and grading in breast cancer histopathological images. Multimedia Tools and Applications, 79(21), 14509–14528.

28.

Liang

Zhu

Wang

Yang

(2022). Penalizing the hard example but not too much: A strong baseline for fine-grained visual classification. IEEE Transactions on Neural Networks and Learning Systems, 35(5), 7048–7059.

29.

Liu

Juhas

Zhang

(2020). Fine-grained breast cancer classification with bilinear convolutional neural networks (BCNNs). Frontiers in Genetics, 11, 547327.

30.

Liu

Zhou

Zhang

Bai

Zhou

Hancock

E. R.

(2022). Where to focus: Investigating hierarchical attention relationship for fine-grained visual classification. In European conference on computer vision (pp. 57–73). Springer.

31.

Ruan

B. K.

Shuai

H. H.

Cheng

W. H.

(2023). Modeling uncertainty for low-resolution facial expression recognition. IEEE Transactions on Affective Computing, 15(1), 198–209.

32.

Maji

Rahtu

Kannala

Blaschko

Vedaldi

(2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.

33.

Selvan

P. S.

Addula

S. R.

Singh

C. E.

Narayanaperumal

Marriwala

N. K.

Appathurai

(2025). Deep learning-enabled fetal health classification through sensor-fused IoT environment. In N. K. Marriwala, V. K. Shukla, S. Jain, D. Kumar, & S. Dhingra (Eds.), Mobile radio communications and 5G Networks (pp. 157–169). Springer Nature Singapore.

34.

Selvaraju

R. R.

Cogswell

Das

Vedantam

Parikh

Batra

(2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626).

35.

Spanhol

F. A.

Oliveira

L. S.

Petitjean

Heutte

(2015). A dataset for breast cancer histopathological image classification. IEEE Transactions on Biomedical Engineering, 63(7), 1455–1462.

36.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

(2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).

37.

Tian

Wang

Chen

Luo

(2023). Diagnose like doctors: Weakly supervised fine-grained classification of breast cancer. ACM Transactions on Intelligent Systems and Technology, 14(2), 1–17.

38.

Tompson

Goroshin

Jain

LeCun

Bregler

(2015). Efficient object localization using convolutional networks. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 648–656). https://doi.org/10.1109/CVPR.2015.7298664

39.

Wah

Branson

Welinder

Perona

Belongie

(2011). The Caltech-UCSD Birds-200-2011 dataset.

40.

Wang

Gao

(2021). Feature fusion vision transformer for fine-grained visual categorization. arXiv preprint arXiv:2107.02341.

41.

Wehrmann

Cerri

Barros

(2018). Hierarchical multi-label classification networks. In International conference on machine learning (pp. 5075–5084). PMLR.

42.

Wei

X. S.

Xie

C. W.

Shen

(2018). Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76, 704–714.

43.

Yang

Jin

Lei

Zhang

(2024). Multi-directional guidance network for fine-grained visual classification. The Visual Computer, 40(11), 8113–8124.

44.

Yang

Luo

Wang

Gao

Wang

(2018). Learning to navigate for fine-grained classification. In Proceedings of the European conference on computer vision (ECCV) (pp. 420–435).

45.

Fang

Jiang

(2025). Alleviating category confusion in fine-grained visual classification. The Visual Computer, 41, 1–16.

46.

Zhang

Cisse

Dauphin

Y. N.

Lopez-Paz

(2018). MixUp: Beyond empirical risk minimization. https://arxiv.org/abs/1710.09412

47.

Zhang

Donahue

Girshick

Darrell

(2014). Part-based R-CNNs for fine-grained category detection. In European conference on computer vision (pp. 834–849). Springer.

48.

Zheng

Mei

Luo

(2017). Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision (pp. 5209–5217).

49.

Zhu

Liu

Tian

Shan

(2022). Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4692–4702).

	Accuracy/ $F 1$ -Score
Method	40 $\times$	100 $\times$	200 $\times$	400 $\times$
ResNet-50 (He et al., 2016)	0.940/0.934	0.958/0.951	0.927/0.920	0.938/0.932
Inception V3 (Szegedy et al., 2016)	0.902/0.896	0.856/0.848	0.861/0.854	0.825/0.819
22-Layer CNN (Hou, 2020)	0.909/0.902	0.900/0.893	0.910/0.905	0.910/0.904
CSD-CNN (Han et al., 2017)	0.928/0.921	0.939/0.934	0.937/0.930	0.929/0.924
MT-FT-Xception (Li et al., 2020)	0.948/0.943	0.940/0.933	0.939/0.931	0.907/0.900
MuDeRN (Gandomkar et al., 2018)	0.956/0.949	0.949/0.944	0.957/0.950	0.946/0.941
NTS-Net (Yang et al., 2018)	0.967/0.960	0.952/0.947	0.934/0.929	0.912/0.905
WS-DAN (Hu et al., 2019)	0.950/0.945	0.962/0.955	0.920/0.913	0.901/0.896
BCNNs (Liu et al., 2020)	0.960/0.953	0.958/0.951	0.947/0.942	0.945/0.940
DLD-Net (Tian et al., 2023)	0.973/0.966	0.974/0.969	0.940/0.933	0.963/0.958
SVM(RBF)-ASO (Atban et al., 2023)	0.961/0.954	0.943/0.936	0.973/0.966	0.963/0.956
EFML (Li et al., 2023)	0.968/0.963	0.974/0.969	0.967/0.960	0.964/0.959
MagNet (Iqbal et al., 2024)	0.974/0.969	0.941/0.934	0.975/0.968	0.981/0.976
MCG (Yang et al., 2024)	0.975/0.970	0.978/0.973	0.976/0.971	0.980/0.975
SFME (Yu et al., 2025)	0.978/0.973	0.975/0.970	0.979/0.974	0.982/0.977
Ours	0.984/0.980	0.984/0.979	0.983/0.978	0.984/0.981

			Accuracy/ $F 1$ -Score
Baseline	$L_{hd}$	$L_{ig}$	CUB	AIR	CAR
✓			0.865/0.841	0.926/0.898	0.936/0.915
✓	✓		0.869/0.845	0.937/0.910	0.943/0.923
✓		✓	0.869/0.844	0.929/0.902	0.942/0.922
✓	✓	✓	0.878/0.854	0.938/0.913	0.948/0.929

		Accuracy/ $F 1$ -Score
Method	Backbone	CUB	AIR	CAR
MC-Loss (Chang et al., 2020)	ResNet50	0.873/0.849	0.926/0.899	0.937/0.916
SPS (Huang et al., 2021)	ResNet50	0.887/0.862	0.927/0.900	0.949/0.928
Chang et al. (Chang et al., 2021)	ResNet50	0.856/0.832	0.920/0.894	0.937/0.916
M2B (Liang et al., 2022)	ResNet50	0.882/0.858	0.942/0.915	0.929/0.908
CHRF (Liu et al., 2022)	ResNet50	0.892/0.867	0.936/0.909	0.952/0.931
HRN (Chen et al., 2022)	ResNet50	0.866/0.842	0.929/0.902	0.936/0.915	ViT-B	0.917/0.892	-	0.948/0.927
TransFG (He et al., 2022)	ViT-B	0.917/0.892	-	0.948/0.927
RAMS-Trans (Hu et al., 2021)	ViT-B	0.913/0.888	-	-
FFVT (Wang et al., 2021)	ViT-B	0.916/0.891	-	-
DCAL (Zhu et al., 2022)	ViT-B	0.914/0.889	0.915/0.889	0.934/0.913
MCG (Yang et al., 2024)	ViT-B	0.918/0.893	0.939/0.912	0.949/0.928
Ours	ResNet50	0.891/0.866	0.947/0.920	0.951/0.930
Baseline	CSWin-L	0.923/0.898	0.908/0.882	0.940/0.919
Ours	CSWin-L	0.926/0.901	0.921/0.895	0.947/0.926

Decoupling Hierarchical Errors: A Dual-Pronged Approach for In-Group and Out-of-Group Challenges in FGVC

Abstract

Keywords

1. Introduction

2.1. Fine-Grained Visual Classification

2.2. Hierarchical FGVC

2.3. Transformer-Based FGVC

2.4. Pathological Image Classification

3. Methods

3.1. Overview

3.2. Hierarchically Discriminative Loss

4.1. Datasets

4.2. Implementation Details

4.3. Experimental Design and Evaluation Metrics

4.6.1. Contribution of L hd and L ig

4.6.5. Qualitative Results

4.7. Comparison with Traditional FGVC Methods

5. Discussion

6. Conclusion

Footnotes

Acknowledgments

Ethical Considerations

Funding

Declaration of Conflicting Interests

Data Availability

References

4.6.1. Contribution of $L_{hd}$ and $L_{ig}$