Feature evolvable learning with image streams

Abstract

Feature Evolvable Stream Learning (FESL) has received extensive attentions during the past few years where old features could vanish and new features could appear when learning with streaming data. Existing FESL algorithms are mainly designed for simple datasets with low-dimension features, nevertheless they are ineffective to deal with complex streams such as image sequences. Such crux lies in two facts: (1) the shallow model, which is supported to be feasible for the low-dimension streams, fails to reveal the complex nonlinear patterns of images, and (2) the linear mapping used to recover the vanished features from the new ones is inadequate to reconstruct the old features of image streams. In response, this paper explores a new online learning paradigm: Feature Evolvable Learning with Image Streams (FELIS) which attempts to make the online learners less restrictive and more applicable. In particular, we present a novel ensemble residual network (ERN), in which the prediction is weighted combination of classifiers learnt by the feature representations from several residual blocks, such that the learning is able to start with a shallow network that enjoys fast convergence, and then gradually switch to a deeper model when more data has been received to learn more complex hypotheses. Moreover, we amend the first residual block of ERN as an autoencoder, and then proposed a latent representation mapping (LRM) approach to exploit the relationship between the previous and current feature space of the image streams via minimizing the discrepancy of the latent representations from the two different feature spaces. We carried out experiments on both virtual and real scenarios over large-scale images, and the experimental results demonstrate the effectiveness of the proposed method.

Keywords

Feature evolvable stream learning (FESL)feature evolvable learning with image streams (FELIS)ensemble residual network (ERN)latent representation mapping (LRM)

1. Introduction

In many practical applications, data usually comes in a streaming way, and hence, Feature Evolvable Stream Learning (FESL), which aims to tackle the streaming data problem, has abstracted considerable attentions in recent years [1, 2, 3, 4, 5]. Specifically, in an open and dynamic environment, data instances arrive at different time steps, where previous features vanish and new features appear. For instance, when we deploy surveillance cameras in the ecosystem to collect data, the parameters of the cameras will change with the replacement of the device. As a result, the features corresponding to the previous cameras would vanish while the features corresponding to new cameras would emerge. A straightforward approach for this task is to learn a new model based on the new features. Whereas, this solution suffers from the lack of new samples and the waste of vanished features.

In order to take advantage of the historical information from the previous feature space, it is urgent to bridge a gap between the previous and the new feature space. The pioneering work for this intuition is presented in [2], whose crucial observation is that the features generally do not change in an arbitrary way; instead, there are some evolving periods in which both old and new features are available. In this regard, since the service cycle of the camera is defined in advance, we can deploy the new device before the old one is completely obsolete, and consequently, there exists an overlapping stage that allows us to obtain two sets of data streams collected by the old camera and the new one. Accordingly, [2] proposes to establish the relationship between previous and current feature spaces by learning a mapping matrix based on the data from the evolving stage. By mapping the new data onto the previous feature space, the historical data or model can be exploited for learning a new model in the new feature space. Since then, a train of researches for learning feature evolvable streams have been explored [6, 7, 3].

However, the existing FESL methods can only be applied to some simple tasks where the data streams have low-dimension features. They are usually impracticable to deal with high-dimension data streams such as image sequences. This fact arises from two reasons. First, in an online learning setting, the target classifying model is usually built on shallow model (e.g., linear/kernel-based hypothesis), it fails to learn the image’s complex patterns. Second, the existing methods employ a linear function to establish the mapping between the old and new feature spaces, which may not be capable to evaluate the complex relationship between the two different feature spaces of image streams.

To fill this gap, this paper explores a new paradigm: Feature Evolvable Learning with Image Streams (FELIS), which relaxes the constraint that the data instances of low-dimension features. To achieve this, we need to address the following questions.

How to choose a proper model capacity (e.g., network depth) of the classifier to learn the complex patterns of images. The challenge is that if the model is too complex, the learning process is difficult to converge for the few arrived instances at the initial rounds. While the model is too simple, the learning capability will be limited as more instances are obtained.

How to exploit the relationship between the old and new feature spaces of the image streams. It is observed that if we apply simple linear function to evaluate the mapping as the existing methods does, the complex relationship of feature spaces is inadequate to be represented. While applying complex nonlinear function, it is unavoidable to introduce large amounts of parameters to be evaluated. These parameters are difficult to be estimated in the online learning setting and the mapping is prone to be overfitting.

In response, we design an elegant solution to the two challenges in a uniform framework. First, inspired by the thoughts of sequential decision [8], we design a novel ensemble residual network (ERN) to learn the complex patterns of image streams in an online setting. ERN is constructed by composing several classifiers that are separately attached by one residual block. The final prediction of the target model is a weighted combination of the predictions of all classifiers. With hedge backpropagation (HBP) strategy, our framework can dynamically select the model capacity and encourages the knowledge to transfer from shallow to deeper networks. Second, instead of using a parametric analyses method to evaluate the mapping between the old and new features, we proposed a novel approach, termed as latent representation mapping (LRM), to learn the relationships of the feature spaces, where the mapping is evaluated based on the alignment of image latent representations derived from an integrated autoencoder. Overall, our contributions are summarized as follows.

It is the first work to explore the FELIS problem, where the learners target at learning how-dimension streams with evolvable feature spaces.

We designed an ERN framework which allows us to train deep neural networks (DNNs) of adaptive capacity meanwhile enabling knowledge sharing between shallow and deep networks.

We proposed a LRM approach to learn the complex relationship between the old and current feature spaces, which is effective and parameter-free.

It is worth emphasizing that our method differs from the work proposed by [9]. In one respect, the basis blocks of network in [9] are comprised of full connection layers, which are ineffective to extract abstract representation of image streams. Our ERN architecture employs residual blocks as the basis blocks, which turn out to be feasible for image processing tasks. In another, the goal of [9] is to address the problem of learning DNN on the fly, provided that the data streams are of fixed features. While our method aims to tackle the challenge of FELIS, where the features of the streams are evolvable over times.

2. Related work

2.1 Online learning

Online learning is a family of scalable algorithms that learn to update models from data streams sequentially [10, 11]. Popular algorithms include Perceptron [12], Online Gradient Descent [13], Passive Aggressive [14], Online Learning with kernels [15], Online Multiple Kernel Learning [16, 17], etc. But these models are too weak to deal with the complex tasks. DNNs are state of the art methods for many complex learning tasks due to their ability to extract increasingly better features at each network layer [18, 19, 20]. However, the improved performance of additional layers in a deep network comes at the cost of added latency and energy usage in feedforward inference. [21] use a BranchyNet allows prediction results for a large portion of test samples to exit the network early via the branches in batch learning. There have been attempts at making deep learning compatible with online learning operate via a sliding window approach with a (mini) batch training stage [22, 23, 24]. These approaches optimize the loss function based on the output of the deepest layer, which is suitable only for batch settings. Similar to BranchyNet, in online settings, [9] propose Hedge Backpropagation to evaluate the performance of every output classifier at each online round using Hedge [25] and extends this Backpropagation to allow DNN capacity to vary dynamically.

2.2 Feature evolving stream learning

Online learning has been extensively studied under different settings, such as learning with experts [10] and online convex optimization [26, 27]. There are strong theoretical guarantees for online learning, and it usually uses regret or the number of mistakes to measure the performance of the learning procedure. However, most of existing online learning algorithms are limited to the case that the feature set is fixed. As the data are usually collected from dynamic environments, it is of great importance to facilitate the learning system with the capability of dealing with the environmental changes. To deal with data streams with evolving feature space, recent studies propose to exploit the relationship between previous feature space and the current one, so that historical data can be further leveraged. [1] allow the arriving instances to carry different sets of features but later instances are assumed to include monotonically more features than the earlier ones. [2] learn a linear mapping from the evolving stage to recover previous features and then ensemble two linear models learned from the recovered features and current ones. And they proposed two methods called FESL-c and FESL-s to improve the performance of learning with streaming data. [6] propose an Evolving Discrepancy Minimization algorithm and use two 5-layer MLP classifiers to apply to FDESL which both the feature space and the distribution of data evolving in the streaming scenario.

2.3 The motivations of our work

Supporting variable features and high-dimension streams are two important aspects in the field of online machine leaning. As shown in Section 2.2, the existing FESL methods aim at solving the problem of feature drifting, however, they are developed to tackle simple tasks such as classifying streams with low-dimension features. They fail to effectively online process the streams with high-dimension features due to their inferior representation capability for complex data patterns. Although some online leaning methods (e.g. BranchyNet and Hedge Backpropagation) illustrated in Section 2.1 are developed to allow DNN capacity for many complex learning tasks, nevertheless, they cannot deal with the variable features of the streams and increase latency and energy usage in the inference procedure. To fill this gap, we explore a new online learning paradigm: FELIS which attempts to make the online learners less restrictive and more applicable. However, to achieve this, it is not a trivial combination of the two family methods. The main challenge of solving FELIS lies to 1) how to choose a proper model capacity to online learn the image streams, and 2) how to effective and efficient to handle the feature drifting for the variant feature spaces.

In response, we design a uniform framework to tackle the two challenges. First, we design an ERN network to learn the complex patterns of the image streams, which gives an effective model selection approach to adapt to the optimal network depth automatically online. Second, we proposed an LRM approach to obtain the relationship of the old and new feature spaces, which is effective and parameter-free.

3. The FELIS problem

3.1 Problem statement

We focus on image classification tasks in the setting of FELIS. That is, on each round of the learning process, the model receives an image instance and then predicts its label before the true label is revealed. Considering the example of image collections from surveillance cameras, we denote ${S_{1}}$ and ${S_{2}}$ as the feature space of the image captured by previous and current camera, respectively. As a special case of FESL [2], FELIS can also be divided into three ordered periods: the pre-overlapping period, the overlapping period and the post-overlapping period. As shown in Fig. 1, in the pre-overlapping period, numberable image streams come from ${S_{1}}$ ; then in the overlapping period, some image streams come from both ${S_{1}}$ and ${S_{2}}$ ; subsequently in the post-overlapping period, image streams only come from ${S_{2}}$ . Suppose there exist $B$ rounds of instances both from $S_{1}$ and $S_{2}$ in the overlapping period, the detailed descriptions of the three periods are as follows.

Figure 1.

Illustration of how image stream comes.

Pre-overlapping period: For $t=1,\ldots,T_{1}-B$ , in each round, the learner receives an image instance $\mathbf{x}_{t}^{S_{1}}\in\mathbb{R}^{d_{1}}$ sampled from $S_{1}$ , in which ${d_{1}}$ is the number of features of ${S_{1}}$ , ${T_{1}}$ is the number of rounds in $S_{1}$ .

Overlapping period: For $t=T_{1}-B+1,\ldots,T_{1}$ , in each round, the learner receives two image instances $\mathbf{x}_{t}^{S_{1}}\in\mathbb{R}^{d_{1}}$ and $\mathbf{x}_{t}^{S_{2}}\in\mathbb{R}^{d_{2}}$ sampled from $S_{1}$ and $S_{2}$ , in which ${d_{2}}$ is the number of features of ${S_{2}}$ .

Post-overlapping period: For $t=T_{1}+1,\ldots,T_{1}+T_{2}$ , in each round, the leaner receives an image instance $\mathbf{x}_{t}^{S_{2}}\in\mathbb{R}^{d_{2}}$ sampled by $S_{2}$ , in which ${T_{2}}$ is the number of rounds in $S_{2}$ .

Note that the total rounds of pre-overlapping, overlapping and post-overlapping periods formulate one cycle and repeats continuously. The post-overlapping period in the current cycle can be recognized as the pre-overlapping period for the next cycle. Our goal is to learn of a series of model $\mathrm{F_{1}},\mathrm{F_{2}},\ldots,\mathrm{F_{T_{1}+T_{2}}}$ under the image streams for providing accurate predictions by minimizing the empirical risk. This goal can be defined as:

$\displaystyle R(\mathrm{F})=\frac{1}{T_{1}+T_{2}}\sum_{t=1}^{T_{1}+T_{2}}% \mathcal{L}(y_{t},\mathrm{F_{t}}(\mathbf{x}_{t})),$ (1)

where $y_{t}$ is the true label and $\mathcal{L}(\cdot,\cdot)$ is an operator used to measure the distance between two probability vectors for the formulation of the empirical risk. Specifically, cross-entropy and Mean Square Error (MSE) are two common metrics which are referably employed to construct the loss, respectively.

3.2 Challenges and our thoughts

From the problem statement, we observe two challenges in solving the FELIS problem, which are described as follows.

Challenge 1: Dilemma of determining the model capacity (e.g., network depth). The existing methods usually use shallow $\mathrm{F}$ , which is based on linear/kernel-based hypothesis, to solve FESL problems for the low-dimension streams (e.g. Australian dataset).1 However, they fail to learn complex patterns of high-dimension features from image streams in the FELIS setting [28]. A plausible solution is to directly use complex $\mathrm{F}$ such as DNN models [29, 30] to learn the complex patterns of image streams [29, 30]. Unfortunately, these models are also infeasible for the FELIS tasks where the image instances are arriving over times. This is because the complex structure DNNs make the learning process converge too slowly at each round for a single instance, losing the desired property for the online learning. Therefore, a well-designed network, which is able to fast converge at the initial stages and achieve deep representation at a later stage, is required in the FELIS setting.

Challenge 2: Difficulty of choosing a proper mapping. In the FELIS setting, the evolving stage $B$ is very short meanwhile the feature dimensions of arriving instances are high. On one hand, linear mapping functions used by the existing FESL approaches for processing the low-dimension data streams are unable to recover the complex relationship between the feature spaces of the vanished images $\mathbf{x}_{t}^{S_{1}}$ and newcoming images $\mathbf{x}_{t}^{S_{2}}$ . On the other hand, if we directly apply complex nonlinear mapping to the FELIS tasks, it is inevitable to introduce enormous parameters for expressing the mapping. These parameters are infeasible to be evaluated due to the scarce of instances at each round and the learned mapping is probe to be overfitting. Accordingly, it is reasonable to exploit an appropriate mapping approach for our setting.

Our Ideas. Our ideas to the two challenges are two-fold. In one respect, We design an ERN to dynamic choose the capacity of the model, meanwhile encouraging learner to obtain knowledge from the shallower layer to the deeper layer. Specifically, we employ several residual blocks [19] as basis components for extracting image features. Then we attach each block to an output classifier and use a Hedge Backpropagation approach [9] to update the gradients of the network on the online setting, which allows the knowledge to be shared between shallow and deep networks.

In another respect, we proposed a LRM method to effectively evaluate the relationship of the feature spaces. Instead of exploiting the feature correlations on the data naive representations, we estimate the feature relationships based on the data latent representations learned by an embedded autoencoder during the overlapping period. The proposed strategy is able to compute the mapping between different feature spaces in a parametric-free way, and hence provide an appropriate solution for the FELIS problems.

4. The proposed approach

Our proposal for solving the FELIS problem is composed of ensemble residual network (ERN) and latent representation mapping (LRM) strategy, where the ERN is used to learn the image patterns in an online way and LRM is applied to settle the feature evolving issue. The implementation of our method, including ERN and LRM, is outlined in Algorithm 1.

[h] : The Proposed Approach[1] Inputs: Hedge Parameter: $\beta\in(0,1)$ ; Learning rate Parameter: $\eta$ ; Smoothing Parameter: $s$ ; Blocks Parameter: ${L}$ ; Tradeoff Parameter: $\lambda$ Initialize: $\mathrm{F}_{t}(\mathbf{x})=$ Residual Network with $L+1$ blocks $\mathrm{h}_{t}^{(l)}$ and $L+1$ classifiers $g_{t}^{(l)},\alpha^{(l)}=\frac{1}{L+1},\forall l=0,\ldots,L$ $\mathrm{t}=1,2,\ldots,\mathrm{T_{1}+T_{2}}$ Receive instance: $\mathbf{x}_{t}$ Predict $\hat{y}_{t}=\mathrm{F}_{t}(\mathbf{x}_{t})=\sum_{l=0}^{L}\alpha_{t}^{(l)}g_{t}% ^{(l)}$ as per Eq. (2); Reveal $y_{t}$ ; Set $\mathcal{L}^{\textit{ERN}}=\sum\nolimits_{l=0}^{L}\alpha^{(l)}\mathcal{L}(g_{t% }^{(l)},y_{t})$ ;

Update $\Theta_{t+1}^{(l)},\forall l=0,\ldots,L$ as per Eq. (5);

Update $W_{t+1}^{(l)},\forall l=1,\ldots,L$ as per Eq. (6);

Update $\alpha_{t+1}^{(l)}=\alpha_{t}^{(l)}\beta^{\mathcal{L}(g_{t}^{(l)},y_{t})},% \forall l=0,\ldots,L$ ;

Smoothing $\alpha_{t+1}^{(l)}=\max(\alpha_{t+1}^{(l)},\frac{s}{L+1}),\forall l=0,\ldots,L$ ; Normalize $\alpha_{t+1}^{(l)}=\frac{\alpha_{t+1}^{(l)}}{Z_{t}}$ where $Z_{t}=\sum\nolimits_{l=0}^{L}\alpha_{t+1}^{(l)}$ ; Update $\psi_{t+1}$ as per Eq. (10);

Figure 2.

The framework of ensemble residual network (ERN). Dark blue lines represent feedforward flow for computing residual block features. Orange lines indicate the prediction flow by the combination of outputs for each residual block. Dark green lines indicate the online updating flows with the hedge backpropagation approach. Light blue and green lines represent the flows that train an autoencoder.

4.1 Ensemble residual network

The framework of ERN is depicted in Fig. 2. First, we construct several residual blocks [19] and attach each block to a classifier. The final prediction of ERN is a weighted combination of the predictions of all classifiers. In particular, let $\{\mathbf{x}_{t}\mid t=1,\ldots,T\}$ denote an input sequence, where $\mathbf{x}_{t}\in\mathbb{R}^{d}$ is a $d$ -dimensional vector, $d$ is the number of pixels of the image in our setting. Let $\mathrm{block}^{(l)}$ be the $l^{\text{th}}$ residual block, and then we denote $\mathbf{x}^{(l)}$ , $W^{(l)}$ and $\mathrm{h}^{(l)}$ as the input, parameter and output of $\mathrm{block}^{(l)}$ , respectively. $L$ is number of the residual blocks, $\mathcal{F}$ is the residual function, e.g., a stack of two $3\times 3$ convolutional layers [19]. $\mathcal{M}$ is a linear projection for matching the dimensions [19], $\sigma$ is ReLU activation function. The weight of each classifier $g_{t}^{(l)}$ is denoted by $\alpha^{(l)}$ $\textgreater$ 0. For the input data stream $\mathbf{x}_{t}$ , the prediction function $\mathrm{F}(\mathbf{x}_{t})$ for the proposed ERN is given by Eq. (2), where pooling is the max pooling operator.

$\displaystyle\mathrm{F_{t}}(\mathbf{x}_{t})=\sum_{l=0}^{L}\alpha^{(l)}g_{t}^{(% l)}\text{ where}$ (2) $\displaystyle g_{t}^{(l)}=\textit{pooling}(\mathrm{h}_{t}^{(l)})\Theta^{(l)},% \forall l=0,\ldots,L$ $\displaystyle\mathrm{h}_{t}^{(l)}=\mathcal{M}(\mathbf{x}_{t}^{(l)})+\mathcal{F% }(\mathbf{x}_{t}^{(l)},W^{(l)}),\forall l=0,\ldots,L$ $\displaystyle\mathbf{x}_{t}^{(l)}=\sigma(\mathrm{h}_{t}^{(l)}),\forall l=1,% \ldots,L$ $\displaystyle\mathbf{x}_{t}^{(0)}=\mathbf{x}_{t}.$

Different from Online Deep Learning (ODL) method that uses fully connected layer as the basic blocks to exploit data representations [9], we use residual network as the blocks in our framework, which exhibits superior capability for the image online learning problem. In addition, different from the original ResNet framework [13] wherein the final prediction is derived from $\mathrm{h}_{t}^{(L)}$ , our framework makes predictions via weighted combining of the outputs from $\mathrm{h}_{t}^{(0)},\ldots,\mathrm{h}_{t}^{(L)}$ .

Second, we use Hedge Backpropagation (HBP) [9, 25], an efficient hedging strategy, to online update the parameter $\alpha^{(l)}$ , $\Theta^{(l)}$ and $W^{(l)}$ in our framework. The loss suffered by the model can be represented as

$\displaystyle\mathcal{L}^{\textit{ERN}}=\sum\nolimits_{l=0}^{L}\alpha^{(l)}% \mathcal{L}(g_{t}^{(l)},y_{t}).$ (3)

For the update of $\alpha$ , we observe that at the first iteration, all weights $\alpha$ are uniformly distributed, i.e., $\alpha^{(l)}=\frac{1}{L+1},l=0,\ldots,L$ ; while for every iteration, the classifier $g_{t}^{(l)}$ makes a prediction $\hat{y}_{t}^{(l)}$ . Hence when the ground truth is revealed, the classifier’s weight is updated based on the loss suffered by the classifier as

$\displaystyle\alpha_{t+1}^{(l)}\leftarrow\alpha_{t}^{(l)}\beta^{\mathcal{L}(g_% {t}^{(l)},y_{t})}$ (4)

where $\beta\in(0,1)$ is the discount rate parameter, and $\mathcal{L}(g_{t}^{(l)},\ y_{t})\in(0,1)$ [25]. In every iteration, a classifier’s weight is discounted by a factor of $\beta^{\mathcal{L}(g_{t}^{(l)}),y_{t})}$ based on its performance. At the end of every round, the weights $\alpha$ are normalized such that $\sum\nolimits_{l}\alpha_{t}^{(l)}=1$ . Besides, to achieve a tradeoff between exploration and exploitation, we set a smoothing parameter ${s}\in(0,1)$ and smooth the weights: $\alpha_{t+1}^{(l)}\leftarrow\max(\alpha_{t+1}^{(l)},\frac{s}{L+1})$ after the weight update in each iteration.

For learning the parameters $\Theta^{(l)}$ for all the classifiers, we employ online gradient descent (OGD) [13] strategy, where the input to the $l^{\text{th}}$ classifier is $\mathrm{h}^{(l)}$ . Similar to the update of the weights for the output layer in the original feedforward networks, the update is given by

$\displaystyle\Theta_{t+1}^{(l)}\leftarrow\Theta_{t}^{(l)}-\eta\nabla_{\Theta_{% t}^{(l)}}\mathcal{L}(\mathrm{F}_{t}(\mathbf{x}_{t}),y_{t})=\Theta_{t}^{(l)}-% \eta\alpha^{(l)}\nabla_{\Theta_{t}^{(l)}}\mathcal{L}(g_{t}^{(l)},y_{t})$ (5)

Learning the feature representation parameters $W^{(l)}$ is different from the original backpropagation scheme, where the error derivatives are backpropagated from the output layer. In our strategy, the error derivatives are backpropagated from every classifier $g_{t}^{(l)}$ . Thus, using the adaptive loss function $\mathcal{L}^{\textit{ERN}}$ and applying OGD rule, the update for $W^{(l)}$ can be represented as

$\displaystyle W_{t+1}^{(l)}\leftarrow W_{t}^{(l)}-\eta\sum_{j=l}^{L}\alpha^{(j% )}\nabla_{W^{(l)}}\mathcal{L}(g_{t}^{(j)},y_{t}),$ (6)

where $\eta$ is the learning rate. The summation (in the gradient term) starts at $j=l$ because only the deeper classifiers rely on $W^{(l)}$ for making predictions. In effect, it is required to compute the gradient of the final prediction with respect to the backpropagated derivatives of a predictor at every block weighted by $\alpha^{(l)}$ , which is an indicator of the performance of the classifier. This offers a good initialization for deeper networks, which are encouraged to match the performance of shallower networks and facilitates knowledge transfer from shallow to deeper networks.

Note that our ERN framework can be referred to as a sequential decision procedure [8, 31], which is a form of online learning with Expert Advice. In this frame, multiplicative weight-update Littlestone-Warmuth rule [31] can be adapted to this model, yielding bounds that are slightly weaker in some cases. Specifically, the cumulative loss (or simply regret) of the learner, or of any expert, is the expected number of mistakes on the entire sequence. In our case, the regret with Hedge mechanism can be bounded by $O(\sqrt{\textit{TlnN}})$ [9], or, put another way, that the average per trial-net regret is decreasing at the rate $O(\sqrt{(\textit{lnN})/T})$ , where $N$ is the number of experts, which in our case is the number of residual blocks (a.k.a. the network depth). Hence, as $T$ increases, this difference deceases to 0, which encourages the confidence of the training for our framework.

4.2 Latent representation mapping

During the rounds $t\textgreater T_{1}$ , the streams from the old feature spaces are vanishing while the streams from the new feature spaces are appearing. The model trained under the old data fails to provide desirable prediction results for the instances from the new feature space data. To fill this gap, [2] assumes that there exists a certain relationship between the old and new feature spaces, and then a linear mapping is employed to learn this relationship. But for image classification tasks in our FELIS setting, linear function is infeasible to represent the complex relationship due to its weak capability. If a complicated nonlinear model (e.g., multivariate regression [32], streaming multi-label learning [33], etc.) is applied, it is unavoidable to introduce numerous parameters for learning the mapping. Due to the lack of instances in the evolution period, it is impractical to estimate these mapping parameters. Accordingly, we propose a LRM method to derive the relationship of the different feature spaces in a parameter-free way. To this end, we modify the first residual block of ERN as an autoencoder [34] to online extract the latent representations for the image streams.

Support $\mathbf{\hat{x}}_{t}$ is reconstruction version of $\mathbf{x}_{t}$ at time $t$ , $\|\cdot\|_{2}$ denotes $l_{2}$ norm, $\mathcal{D}_{t}^{s_{1}}$ and $\mathcal{D}_{t}^{s_{2}}$ are the dataset from the old and current feature space at time $t$ , $\phi(\cdot)$ stands for the encoder by which the inputs are encoded into their corresponding latent representations. For the pre-overlapping (post-overlapping) period, since the instances are from the same feature space in this period, the loss of the autoencoder is identified as the reconstruction loss, i.e., the discrepancy of the instance and its reconstruction version, as shown in Eq. (7).

$\displaystyle\mathcal{L}^{\textit{Recon}}=\mathbb{E}_{\mathbf{x}_{t}\in% \mathcal{D}_{t}^{s_{1}}(\mathcal{D}_{t}^{s_{2}})}\frac{1}{d}\|{\mathbf{x}_{t}-% \mathbf{\hat{x}}_{t}}\|_{2}^{2}.$ (7)

Figure 3.

For the overlapping period, the discrepancy of latent representations of the instances from the old and current feature spaces are imposed to be minimized.

For the overlapping period, there exist two different feature spaces simultaneously, then the loss of the autoencoder is consisted of the reconstruction loss and the discrepancy of the latent representations from different feature spaces (as shown Fig. 3). Here, the discrepancy of the latent representations can be measured by Maximum Mean Discrepancy (MMD) [35, 36] metric, which formulates the loss

$\displaystyle\mathcal{L}^{\textit{MMD}}=\left\|\mathbb{E}_{\mathbf{x}_{t}^{(i)% }\in\mathcal{D}_{t}^{s_{1}}}\phi(\mathbf{x}_{t}^{(i)})-\mathbb{E}_{\mathbf{x}_% {t}^{(j)}\in\mathcal{D}_{t}^{s_{2}}}\phi(\mathbf{x}_{t}^{(j)})\right\|_{2}.$ (8)

Overall, the loss function of the autoencoder for one cycle (comprising of pre-overlapping, overlapping and post-overlapping period) at time $t$ can be represented as

$\displaystyle\mathcal{L}^{AE}=\left\{\begin{array}[]{ll}\mathcal{L}^{\textit{% Recon}}&t=1,\ldots,T_{1}-B\\ \mathcal{L}^{\textit{Recon}}+\lambda(\mathcal{L}^{\textit{MMD}})^{2}&t=T_{1}-B% +1,\ldots,T_{1}\\ \mathcal{L}^{\textit{Recon}}&t={T_{1}+1,\ldots,T_{1}+T_{2}},\\ \end{array}\right.$ (9)

where $\lambda$ determines the approximating strength of the latent representations. Let $\psi_{t}$ be the parameters of the autoencoder at time $t$ , the update of $\psi_{t}$ can be represented as

$\displaystyle\psi_{t+1}=\psi_{t}-\eta\nabla_{\psi_{t}}\mathcal{L}^{AE}(\mathbf% {\hat{x}_{t}},\mathbf{x}_{t}),$ (10)

where $\eta$ is the learning rate.

Note that the mapping from the new and old feature spaces are actually established in an indirect way by the LRM strategy. As shown in Fig. 2 the autoencoder is embedded in the framework of ERN, where the encoder, rather than the decoder, is attached to the classifier for the feedforward flow. It implies that the latent representations, instead of the naive representation of the images, contribute to the learning process. Thus, when new instance arriving in the post-overlapping stage, we just need obtain its latent representation by the autoencoder that have been trained during the pre-overlapping and overlapping stages, and then feed such latent representation to the classifier to proceed the learning. Hence, it is not required to map the new features to the old ones in their naive representations in practice. In addition, when the number of features is different from the old and new spaces, the images are required to be resized to the same size by bilinear interpolation or pooling approach for matching the input format of the autorencoder.

5. Experiments

5.1 Scenarios and datasets

To validate the effectiveness of our method for the FELIS problem, we constructed six image streams as the experimental datasets from CIFAR-10 [30] and CIFAR-100 [30], where CIFAR-10 has 6000 examples of each of 10 classes and CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Each of the six datasets contains two different feature spaces. For the old feature space, the instances are the images of CIFAR-10/CIFAR-100. For the new feature space, the instances are the images transformed from the old feature space. Overall, these datasets can be divided into virtual scenario and real scenario. For the virtual scenario, we generate new feature spaces by adding random Gaussian noise to CIFAR-10/CIFAR-100. For the real scenario, two applications including channel transforming and contrast transforming for ecosystem are considered. In the case of channel transforming, we support that the ecosystem is required to be monitored all day long, and hence optical camera that captures color images for the day time and infrared camera that captures gray images for the night are set up. We simulated this setting by converting CIFAR-10/CIFAR-100 images of three-channel into gay images of one-channel for generating new feature spaces. In the case of contrast transforming, we suppose that when we replace camera with new one for equipment replacement, the contrast of images would be changed. To simulate this setting, we convert the contrast of CIFAR-10/CIFAR-100 images to construct new feature spaces. A sketch of the six datasets are illustrated in Table 1 and the detail descriptions of the datasets are as follows.

Table 1
The sketch of the datasets

Dataset	Old feature space	New feature space	Parameters	Scenario
CIFAR-10-GN	CIFAR-10	By adding noise	Noise subjected to $N(0,1)$	Virtual
CIFAR-100-GN	CIFAR-100	By adding noise	Noise subjected to $N(0,1)$	Virtual
CIFAR-10-CHC	CIFAR-10	By transforming channel	ITU-R 601-2 luma transform	Real
CIFAR-100-CHC	CIFAR-100	By transforming channel	ITU-R 601-2 luma transform	Real
CIFAR-10-COC	CIFAR-10	By transforming contrast	Scaling factor $[0.5,1.5]$	Real
CIFAR-100-COC	CIFAR-100	By transforming contrast	Scaling factor $[0.5,1.5]$	Real

CIFAR-10-GN: the images from CIFAR-10 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period is generated by adding noises to images in the pre-overlapping period, where the noise is subjected to the standard normal distribution.

CIFAR-100-GN: the images from CIFAR-100 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period is generated by adding noises to images in the pre-overlapping period, where the noise is subjected to the standard normal distribution.

CIFAR-10-CHC: the images from CIFAR-10 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period is generated via converting the images in the pre-overlapping period to one-channel images with ITU-R 601-2 luma transform. This luma transform can be represented as $L=R*29.9\%+G*58.7\%+B*11.4\%$ , where $L$ is the gray of the one-channel image, $R$ , $G$ and $B$ are the gray of the red, green and blue channel of the three-channel image.

CIFAR-100-CHC: the images from CIFAR-100 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period are generated via converting the images in the pre-overlapping period to one channel images with ITU-R 601-2 luma transform. This luma transform can be represented as $L=R*29.9\%+G*58.7\%+B*11.4\%$ , where $L$ is the gray of the one-channel image, $R$ , $G$ and $B$ are the gray of the red, green and blue channel of the three-channel image.

CIFAR-10-COC: the images from CIFAR-10 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period are generated via multiplying the contrast of the images in the pre-overlapping period by 0.5 to 1.5 times.

CIFAR-100-COC: the images from CIFAR-100 are used as the data stream for the old feature space at the pre-overlapping period. The data stream for the new feature space at the post-overlapping period are generated via multiplying the contrast of the images in the pre-overlapping period by 0.5 to 1.5 times.

5.2 Comparators and settings

We take seven online learning competitors along with the proposed method for experiments.

NOGD (Naive Online Gradient Descent) [13]: uses ResNet-18 [19] as classifier, where once the feature space changes, the online gradient descent algorithm will be invoked from scratch.

FESL-c (FESL-combination) [2]: mentioned in Introduction section, which uses shallow classifier for online learning and linear mapping for approximating the relationship of the old and new feature spaces. This approach makes predictions for the new instances by combining the outputs of the old and new models based on exponential of the cumulative loss.

FESL-s (FESL-selection) [2]: instead of combining both the old and new models for predicting new instances as FESL-c does, FESL-s utilizes the best single model to make predictions. Note that although this method selects one model for predictions, it relies on both models for updating parameters.

ODL [9]: mentioned in Introduction section, where the prediction of the network is weighted combination of several 20-layer fully connected neural networks. The structure of the network in ODL is similar with the proposed ERN, except for the ensembled feature extractors.

FESL-c-Variant: this is a variant of FESL-c, where the linear classifier of FESL-c is substituted with the proposed ERN.

ERN $+$ LM: this is a variant of the proposed method, where the LRM of our method is replaced with linear mapping function for identifying the relationship of the old and new feature spaces.

ERN: compared to the proposal, this method works for invariant feature spaces, thus only ERN is applied.

NOGD, ODL and ERN target at online learning setting with non-evolving feature spaces, thus they have no mapping implementations. To enable these methods to be applicable to our setting, we yield these methods to make predictions for the new instances depending on the models trained under the pre-overlapping period. Note that NOGD, ODL, FESL-c and FESL-s are baseline methods used to make comparisons with our approach, while FESL-c-Variant, ERN $+$ LM and ERN are the invariants of the baselines and our proposal, which are used to make ablation studies.

All the methods are evaluated on image classification tasks on rounds $T_{1}+1,\ldots,T_{1}+T_{2}$ . To validate our analysis, a measure of trend of average cumulative loss (ACL) is presented. The smaller ACL presents, the better model capability we achieve. Concretely, at each time $t^{\prime}$ , the loss $\overline{\ell}_{t^{\prime}}$ of every method is the average of the cumulative loss over 1, …, $t^{\prime}$ , namely $\overline{\ell}_{t^{\prime}}=(1/t^{\prime})\sum_{t=1}^{t^{\prime}}\ell_{t}$ . We also present the classification performance over all instances on rounds $T_{1}+1,\ldots,T_{1}+T_{2}$ . The performances of all approaches are obtained by average results over 10 independent runs.

The parameters we need to set are the number of instances in overlapping period, i.e., $B$ , the number of instances in ${S_{1}}$ and ${S_{2}}$ , i.e., ${T_{1}}$ and ${T_{2}}$ , the learning rate $\eta$ . For all the methods, the parameters are the same. In our experiments, we set $L=$ 7, $B=$ 1000 and ${T_{1}}$ and ${T_{2}}$ to be half of number of instances. We fix the learning rate $\eta$ of 0.01 in ${T_{1}}$ and 0.003 in ${T_{2}}$ . For HBP, we set $\beta=$ 0.99 and the smoothing parameter $s=$ 0.08. We set regularization factor $\lambda=$ 0.9 in MMD loss. The details of the hyperparameter settings of the comparing methods are summarized in Table 2.

Table 2
The hyperparameters of the comparing methods, where $\eta$ is the learning rate, $T_{1}$ and $T_{2}$ is the number of rounds in $S_{1}$ and $S_{2}$ , $B$ is the overlapping rounds, $\lambda$ is regularization factor in MMD loss

	$\eta(T_{1})$	$\eta(T_{2})$	$T_{1}$	$T_{2}$	$B$	$L$	$\lambda$
NOGD	0.01	0.003	50000	50000	1000	–	–
FESL-c	0.01	0.003	50000	50000	1000	–	–
FESL-s	0.01	0.003	50000	50000	1000	–	–
ODL	0.01	0.003	50000	50000	1000	7	–
FESL-c-Variant	0.01	0.003	50000	50000	1000	7	–
ERN $+$ LM	0.01	0.003	50000	50000	1000	7	–
ERN	0.01	0.003	50000	50000	1000	7	–
Our method	0.01	0.003	50000	50000	1000	7	0.9

5.3 Experimental results

Our experiments are carried out for answering the following research questions.

Q1. Does our method outperform the baselines? This tests whether our proposal is effective for solving the FESL problem.

The first four and last lines in Table 3 shows the classification accuracies of the baseline methods and our proposal in FESL setting. It can be observed that our method (ERN $+$ LRM) archives the best accuracy rate among the comparing methods over various datasets. The accuracy rates of FESL-c and FESL-s are conspicuously lower than the ones of our method, it is because FESL-c and FESL-s use shallow model and linear mapping for dealing with simple data streams with low-dimensions, and then these structures fail to learn complex patterns of the images stream. ODL and NOGD can achieve higher accuracy rates compared to FESL-c and FESL-s due to their deeper classification model. However, these methods are applicable for online learning setting with non-evolvable features, thus they are unable to use mapping function to reconstruct the old features from the new ones, and then degrades the accuracies for the predictions of the new instances.

Figure 4 shows the trends of ACL of our method and its competitors. From the experimental results, we obtain the following observations. First, the ACLs of FESL-c and FESL-s decrease rapidly when small amount of data streams arrive in pre-overlapping stage, but with the advent of evolvable-feature instances in the post-overlapping stage, their ACLs stop declining or even going up. This is due to the fact that the target classifiers of FESL-c and FESL-s are simple linear models which have insufficient capacities to fit the new image instances with variable features. Second, although the ACLs of NOGD and ODL keep falling in all stages, their ACLs are still higher than ours at any given time. This is because in the pre-overlapping stage, the classifier’s capability of our method is superior to the ones of NOGD and ODL; while in the post-overlapping stage, NOGD and ODL have no mapping measure to handle the new instances with evolvable features.

Table 3
The prediction accuracies of the proposed method and its comparing methods for different datasets

	CIFAR-10- CHC	CIFAR-10- COC	CIFAR-10- GN	CIFAR-100- CHC	CIFAR-100- COC	CIFAR-100- GN
Baselines
NOGD	0.734 $\pm$ 0.015	0.718 $\pm$ 0.046	0.716 $\pm$ 0.027	0.327 $\pm$ 0.060	0.335 $\pm$ 0.024	0.390 $\pm$ 0.049
ODL	0.432 $\pm$ 0.057	0.448 $\pm$ 0.069	0.416 $\pm$ 0.043	0.178 $\pm$ 0.006	0.181 $\pm$ 0.007	0.201 $\pm$ 0.018
FESL-c	0.236 $\pm$ 0.012	0.223 $\pm$ 0.006	0.311 $\pm$ 0.006	0.139 $\pm$ 0.007	0.139 $\pm$ 0.005	0.230 $\pm$ 0.008
FESL-s	0.322 $\pm$ 0.013	0.302 $\pm$ 0.006	0.375 $\pm$ 0.014	0.183 $\pm$ 0.008	0.188 $\pm$ 0.013	0.265 $\pm$ 0.008
Variants
ERN	0.745 $\pm$ 0.017	0.725 $\pm$ 0.013	0.717 $\pm$ 0.028	0.353 $\pm$ 0.046	0.350 $\pm$ 0.038	0.388 $\pm$ 0.021
FESL-c-Variant	0.684 $\pm$ 0.021	0.675 $\pm$ 0.019	0.649 $\pm$ 0.031	0.256 $\pm$ 0.022	0.263 $\pm$ 0.024	0.331 $\pm$ 0.034
ERN $+$ LM	0.314 $\pm$ 0.103	0.289 $\pm$ 0.125	0.308 $\pm$ 0.097	0.127 $\pm$ 0.086	0.116 $\pm$ 0.092	0.131 $\pm$ 0.095
ERN $+$ LRM (ours)	0.774 $\pm$ 0.011	0.755 $\pm$ 0.019	0.723 $\pm$ 0.015	0.421 $\pm$ 0.030	0.407 $\pm$ 0.023	0.395 $\pm$ 0.026

Figure 4.

The ACLs of the proposed method and its competitors for various datasets over the time from 1 to $T_{1}+T_{2}$ .

Q2. Is ERN conductive to learning complex patterns of image streams in online learning setting? This tests whether ERN can dynamically select an appropriate model capacity.

We carried out several ablation experiments for answering this question, and the experimental results are listed in Table 3. From the table, we have the following observations. (1) The classification accuracy of FESL-c-Variant is higher than that of FESL-c. The main difference between FESL-c-Variant and FESL-c is only their target models. This fact illustrates that ERN is more suitable to process FESL tasks compared to the linear-hypnotized-based models as FESL-c uses. (2) The classification accuracy of ERN is higher than that of NOGD. This is because the ensemble learning framework of ERN facilitates the knowledge to transfer from shallow to deeper networks, which is contribute to obtaining complex patterns from image streams. (3) The classification accuracy of ERN is higher than that of ODL. This demonstrates that the residual blocks that formulate the basic components for ERN have better capability than linear-connected blocks when dealing with image streams.

Figure 5 visualized the weight distributions of the attached classifiers regarding with the residual blocks. The vertical axis represents the weights, and the horizontal axis represents the index of the classifiers. Note that higher index implies deeper classifier. From this figure, we can observe that in the initial stage wherein first 1% instances arrived, shallow classifiers have high weights (Fig. 5a). This indicates that when the number of instances are small, the residual blocks in the shallow layers play a leading role. As more instances arrive (Fig. 5b), the weights of shallower classifiers are decreasing while the weights of deeper classifiers are going up. When the proportions of the arrived instances are up to 60%–80%, the weights of the deep classifier take a dominant place. This phenomenon illustrates that the ensemble learning framework of ERN encourages the knowledge to transfer from shallow to deeper networks, and thereby automatically adapting the effective depth of the network to learn an appropriate capacity network based on the image streams.

Q3. Does LRM improve the prediction accuracy for the new instances? This tests whether LRM can effectively identify the relationship of the old and new feature spaces.

We take ERN $+$ LM, the variant of the proposed method, for ablation studies. Compared to our method, ERN+LM uses linear function to identify the mapping of the feature spaces. The last two lines in Table 3 show the experimental results. From the results, we can find that linear function is ineffective to uncover the complex relationship between the old and new feature spaces due to the high dimensions of the image streams. Instead, our LRM method can achieve favorable result to express such complex connections, which encourage the learner to make predictions relying upon the reconstructed features.

Q4. Effects of MMD regularization in LRM and the block implementation in ERN.

To validate the effectiveness of the MMD term, we have carried out an ablation experiment. The comparing method is the proposed method and the method without MMD term (termed as Without MMD). The experimental results are listed in Table 4. We can observe from the table that the prediction accuracy rates of the proposed method are higher than that obtained by the Without MMD method. This is because MMD loss can effectively align the latent representations of the old and current instances, which facilitates to establish the relationship between the old and feature spaces. Consequently, the prediction performance for the new instances is improved.

To test the effect of block implementation of our framework, we replace the origin block of Resnet with ResNeXt [37] in ERN and conduct experiments under the same setting. The result is shown in Table 4, where the variant of the proposed method is termed as With ResNeXt. From the table, we observe that our framework with ResNeXt block can further improve the online learning performance compared to that with Resnets, due to the more powerful feature representation ability of ResNeXt. This fact indicates that the blocks of our framework play a vital role for the learning performance. Designing more elaborate blocks may contribute to enhancing the prediction accuracies of the network.

Table 4

The prediction accuracy rates of the variants of the proposed method

	CIFAR-10- CHC	CIFAR-10- COC	CIFAR-10- GN	CIFAR-100- CHC	CIFAR-100- COC	CIFAR-100- GN
The proposal	0.774 $\pm$ 0.011	0.755 $\pm$ 0.019	0.723 $\pm$ 0.015	0.421 $\pm$ 0.030	0.407 $\pm$ 0.023	0.395 $\pm$ 0.026
Without MMD	0.749 $\pm$ 0.023	0.731 $\pm$ 0.017	0.718 $\pm$ 0.019	0.357 $\pm$ 0.038	0.356 $\pm$ 0.024	0.388 $\pm$ 0.033
With ResNeXt	0.793 $\pm$ 0.024	0.778 $\pm$ 0.027	0.755 $\pm$ 0.018	0.447 $\pm$ 0.035	0.424 $\pm$ 0.038	0.418 $\pm$ 0.031

Figure 5.

Weight distributions of the attached classifiers regarding with the residual blocks when training on CIFAR-100-COC.

6. Conclusion

This paper explored a new online learning paradigm, termed FELIS, to tackle the challenge where the existing FESL methods are infeasible to deal with high-dimension data streams. First, we propose an ERN framework to make the learner online obtain the complex patterns of image streams. This framework is integrated with several residual blocks and make predictions by weighted combination of the outputs of the residual blocks, which encourage the knowledge to transfer from shallow to deeper networks as the number of instances increases. Second, We substitute the first residual block of ERN with an autoencoder and present a latent representation mapping (LRM) approach to solve the issue of feature evolving. LRM is able to capture the complex relationship of the old and new feature spaces in a parameter-free way, which facilitate the learner to make prediction for the new instances aligned with the previous learned model. In our future work, we will make in-depth studies of regression tasks such as image super-resolution and image inpainting for the FELIS setting.

Footnotes

Datasets can be found in http://archive.ics.uci.edu/ml/.

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 62072127, No. 62002076), Project 6142111180404 supported by CNKLSTISS, Science and Technology Program of Guangzhou, China (No. 202002030131, No. 201904010493), Guangdong basic and applied basic research fund joint fund Youth Fund (No. 2019A1515110213), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF-IPIC202101), Natural Science Foundation of Guangdong Province (No. 2023A1515011774, No. 2020A1515010423), Scientific research project for Guangzhou University (No. RP2022003).

Declarations

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

References

Zhang

Long

Ding

Zhang

and Wu

, Online learning from trapezoidal data streams, IEEE Transactions on Knowledge and Data Engineering 28(10) (2016), 2709–2723.

Hou

B.-J.

Zhang

and Zhou

Z.-H.

, Learning with feature evolvable streams, arXiv preprint arXiv:1706.05259, 2017.

Beyazit

Alagurajah

and Wu

, Online learning from data streams with varying feature spaces, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 3232–3239.

Beyazit

Chen

and Wu

, Online learning from capricious data streams: a generative approach, in: International Joint Conference on Artificial Intelligence Main Track, 2019.

Yuan

Chen

and Wu

, Online Learning in Variable Feature Spaces under Incomplete Supervision, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 4106–4114.

Zhang

Z.-Y.

Zhao

Jiang

and Zhou

Z.-H.

, Learning with Feature and Distribution Evolvable Streams, in: International Conference on Machine Learning, PMLR, 2020, pp. 11317–11327.

Hou

B.-J.

Yan

Y.-H.

Zhao

and Zhou

Z.-H.

, Storage fit learning with feature evolvable streams, arXiv preprint arXiv:2007.11280, 2020.

Cesa-Bianchi

and Lugosi

, Prediction, Learning, and Games, Prediction, Learning, and Games, 2006.

Sahoo

Pham

and Hoi

S.C.

, Online deep learning: Learning deep neural networks on the fly, arXiv preprint arXiv:1711.03705, 2017.

10.

Cesa-Bianchi

and Lugosi

, Prediction, learning, and games, Cambridge university press, 2006.

11.

Hoi

S.C.

Wang

and Zhao

, Libol: A library for online learning algorithms, Journal of Machine Learning Research 15(1) (2014), 495.

12.

Rosenblatt

, The perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review 65(6) (1958), 386.

13.

Zinkevich

, Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the 20th International Conference on Machine Learning (icml-03), 2003, pp. 928–936.

14.

Crammer

Dekel

Keshet

Shalev-Shwartz

and Singer

, Online passive aggressive algorithms, 2006.

15.

Kivinen

Smola

A.J.

and Williamson

R.C.

, Online learning with kernels, IEEE Transactions on Signal Processing 52(8) (2004), 2165–2176.

16.

Hoi

S.C.

Jin

Zhao

and Yang

, Online multiple kernel classification, Machine Learning 90(2) (2013), 289–316.

17.

Sahoo

Hoi

S.C.

and Li

, Online multiple kernel regression, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 293–302.

18.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

19.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

20.

Wang

Kuang

Tan

Y.-a.

and Li

, The security of machine learning in an adversarial setting: A survey, Journal of Parallel and Distributed Computing 130 (2019), 12–23.

21.

Teerapittayanon

McDanel

and Kung

H.-T.

, Branchynet: Fast inference via early exiting from deep neural networks, in: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE, 2016, pp. 2464–2469.

22.

Zhou

Sohn

and Lee

, Online incremental feature learning with denoising autoencoders, in: Artificial Intelligence and Statistics, PMLR, 2012, pp. 1453–1461.

23.

Lee

S.-W.

Lee

C.-Y.

Kwak

D.-H.

Kim

and Zhang

B.-T.

, Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors., in: IJCAI, 2016, pp. 1669–1675.

24.

Yoon

Yang

Lee

and Hwang

S.J.

, Lifelong learning with dynamically expandable networks, arXiv preprint arXiv:1708.01547, 2017.

25.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55(1) (1997), 119–139.

26.

Hazan

Agarwal

and Kale

, Logarithmic regret algorithms for online convex optimization, Machine Learning 69(2–3) (2007), 169–192.

27.

Shalev-Shwartz

et al., Online learning and online convex optimization, Foundations and trends in Machine Learning 4(2) (2011), 107–194.

28.

Bengio

LeCun

et al., Scaling learning algorithms towards AI, Large-scale Kernel Machines 34(5) (2007), 1–41.

29.

LeCun

Boser

Denker

J.S.

Henderson

Howard

R.E.

Hubbard

and Jackel

L.D.

, Backpropagation applied to handwritten zip code recognition, Neural Computation 1(4) (1989), 541–551.

30.

Krizhevsky

Hinton

et al., Learning multiple layers of features from tiny images, 2009.

31.

Freund

and Schapire

R.E.

, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, in: Conference on Learning Theory, 1997.

32.

Kibria

B.G.

, Bayesian statistics and marketing, Taylor & Francis, 2007.

33.

Read

Bifet

Holmes

and Pfahringer

, Streaming multi-label classification, in: Proceedings of the Second Workshop on Applications of Pattern Analysis, JMLR Workshop and Conference Proceedings, 2011, pp. 19–25.

34.

Hinton

G.E.

and Salakhutdinov

R.R.

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

35.

Borgwardt

K.M.

Gretton

Rasch

M.J.

Kriegel

H.-P.

Schölkopf

and Smola

A.J.

, Integrating structured biological data by kernel maximum mean discrepancy, Bioinformatics 22(14) (2006), e49–e57.

36.

Tzeng

Hoffman

Zhang

Saenko

and Darrell

, Deep domain confusion: Maximizing for domain invariance, arXiv preprint arXiv:1412.3474, 2014.

37.

Xie

Girshick

Dollár

and He

, Aggregated Residual Transformations for Deep Neural Networks, IEEE, 2016.

Feature evolvable learning with image streams

Abstract

Keywords

1. Introduction

2. Related work

2.1 Online learning

2.2 Feature evolving stream learning

2.3 The motivations of our work

3. The FELIS problem

3.1 Problem statement

4. The proposed approach

5.1 Scenarios and datasets

Table 1 The sketch of the datasets

Table 2 The hyperparameters of the comparing methods, where η is the learning rate, T 1 and T 2 is the number of rounds in S 1 and S 2 , B is the overlapping rounds, λ is regularization factor in MMD loss

Table 3 The prediction accuracies of the proposed method and its comparing methods for different datasets

Footnotes

Acknowledgments

Declarations

References

Table 1
The sketch of the datasets

Table 2
The hyperparameters of the comparing methods, where $\eta$ is the learning rate, $T_{1}$ and $T_{2}$ is the number of rounds in $S_{1}$ and $S_{2}$ , $B$ is the overlapping rounds, $\lambda$ is regularization factor in MMD loss

Table 3
The prediction accuracies of the proposed method and its comparing methods for different datasets