Abstract
In Transfer Learning (TL) a model that is trained on one problem is used to simplify the learning process on a second problem. TL has achieved impressive results for Deep Learning, but has been scarcely studied in genetic programming (GP). Moreover, predicting when, or why, TL might succeed is an open question. This work presents an approach to determine when two problems might be compatible for TL. This question is studied for TL with GP for the first time, focusing on multiclass classification. Using a set of reference problems, each problem pair is categorized into one of two groups. TL compatible problems are problem pairs where TL was successful, while TL non-compatible problems are problem pairs where TL was unsuccessful, relative to baseline methods. DeepInsight is used to extract a 2D projection of the feature space of each problem, and a similarity measure is computed by registering the feature space representation of both problems. Results show that it is possible to distinguish between both groups with statistical significant results. The proposal does not require model training or inference, and can be applied to problems from different domains, with a different a number of samples, features and classes.
Introduction
Machine learning (ML) has greatly impacted the way in which difficult problems are solved in many domains. Nonetheless, several issues with ML-based problem solving persist. Some of these issues include the brittleness of some ML models [32], the necessity for overwhelming amounts of data in some domains [23], and the apparent existence, but still elusive, overarching master algorithm [10]. Another important issue concerns the re-usability of ML models. In general, it is assumed that each time a ML algorithm is applied to a particular dataset, it generates a model that is specialized for that particular problem instance, and while it is expected to generalize to unseen data, it is assumed that the training and test sets are drawn from the same underlying distribution and domain.
This is accepted in ML, but it is counter-intuitive to how learning works for many life forms [42]. In some scenarios it is reasonable to assume that it should be possible to learn on one task, and apply what is learnt on another. Recently, this general idea has coalesced around what is called as domain adaptation [34], or more generally known as transfer learning (TL) [39,40], which can be characterized as follows. A learning algorithm is applied to one problem, the source task, and some knowledge or elements of what is learnt is used as part of the solution or training process for a second problem, called the target task.
TL addresses several of the issues with ML mentioned above. First, it allows for the reuse of previously learnt models, extending their usefulness beyond the original problem it was intended to solve. Second, it impacts on the amount of data and computation that is required to solve a new task, since much of the learning was already done on the source task. TL has been applied in several domains, including natural language processing [29], and computer vision [25], since it can potentially simplify and speedup the process of solving new problems, for example with deep neural networks (DNN) that often require large amounts of training data [14,24].
One of the pitfalls, however, when using TL to solve a new target problem is that there is no general way in which to determine when TL might be successful, or to provide an objective explanation of why this might be the case [5,7,27]. A general rule of thumb is that the target and source domains should be similar, but this is mostly expressed as a subjective feature of the TL setup, rather than an objectively verifiable property [40]. For instance, in the case of computer vision, this can sometimes be convincingly argued since, for example, it is natural for a human expert to see the similarities between detecting a particular illness in one type of on image or another [24]. However, there is a limited amount of literature concerning the development of a principled approach towards predicting the success or failure of a TL system, particularly when TL is applied to problems from different domains [7].
This work presents a coarse-grained analysis of what is referred to as TL compatibility between a source and a target problem, analyzing the results presented in [27]. While TL compatibility is presented in detail in Section 3.1, it can be understood as a measure of success for TL when it is applied to a particular pair of problems. The main motivating research question of this work can be stated as follows. How is TL compatibility related to the feature space similarity between the source and the target tasks? Our goal is to determine when two problems could be appropriate candidates for a successful TL to take place, and to provide a plausible explanation of why it is so. We propose to analyse the feature space of the source and target tasks using a structured Euclidean representation that allows for a principled way in which problem similarity can be computed. In particular, we use the approach proposed by the DeepInsight method, since it allows us to extract a visual representation (an image) from non-image data by projecting a problem’s feature space onto a 2D plane [36]. It is hypothesized that when a suitable correspondence between the feature spaces of both tasks exists, then one can expect a high level of TL compatibility, and vice versa. The proposed approach could be used to determine when TL might be applicable to evolved models with genetic programming (GP) [20]. Therefore, we extend the work presented in [27], and present the first study that shows how the similarity between two problems can be measured to determine the level of success achieved when these problems are used to perform TL. The proposed approach does not require model training or inference, and can be applied to problems from different domains, with a different number of features dimensions, a different number of classes and different amounts of samples from each problem. Moreover, unlike previous works that have studied how and when TL is successful [5,7], this work studies TL by GP for the first time, and does not rely on problems from the same domain.
The remainder of the paper proceeds as follows. Section 2 discusses how TL has been applied in GP and previous approaches towards predicting TL success. Section 3 presents our methodology, including our experimental setup, data and evaluation criteria. Results are presented and discussed in Section 4. Finally, Section 5 presents our conclusions.
Background and related work
GP performs learning using an evolutionary process as defined by neo-Darwinian principles [33]. It proceeds using a basic evolutionary cycle consisting of the following elements. First, a domain or problem specific representation is defined for the task at hand. For ML tasks such as regression or classification a common representation is to use a tree structure. Trees are constructed using a set of primitive elements, such that internal nodes can contain basic operators (arithmetic, trigonometric or others) used to construct the desired model or function, while leaf nodes include input variables or 0-arity functions. Afterward, the evolutionary process begins by generating a random set (population) of solution candidates (individuals), each of which is evaluated to determine its ability (fitness) to solve the problem. In the case of supervised ML the trees are evaluated from the leafs (inputs) to the root node (output) on each sample in the training set, and the output is compared with the target or desired output using an error measure such as classification accuracy or a loss function. Based on the assigned fitness, solutions are stochastically selected and used to generate new solution candidates using variation or search operators. The selected solutions are referred to as parents and the generated solutions are referred to as offspring. The two basic variation operators are mutation and crossover. The former starts with a single parent and randomly modifies it to generate an offspring, while the latter normally uses two parents and randomly combines them to generate an offspring. The offspring and the parents are then used to construct a new population using a survival strategy, and the process is iterated until a termination condition is reached, such as a maximum number of iterations (generations).
GP has been shown to be a powerful learning mechanism, capable of generating accurate and interpretable models for both classification [15] and regression [21,28]. However, TL has received only marginal research interest with GP, particularly when compared to Deep Learning (DL). We previously presented a comprehensive survey on TL in GP [27], from which we can highlight the following. First, TL can be traced back to various forms of population seeding techniques [22] or problem decomposition approaches [37]. Second, interest for applying TL in GP has gradually increased in recent years in various domains [9,13,19,31]. More recently, TL in GP has been used in urban traffic modeling [11] and a variant of TL has been applied to program synthesis [17].
Moreover, in our previous work we present a unique analysis of TL with GP [27], using both multiclass classification and symbolic regression problems. Results show that TL can in fact be successful even when the domains of the target and source problems are different. Moreover, TL does not seem to be symmetric; i.e., a successful transfer from source to target does not guarantee that if the roles were inverted then performance would also be good. Related to the topic of the present paper, among works that deal with TL with GP, our former work [27] is the only one to consider the question of predicting the success or failure of a transferred solution. Regarding classification, for instance, [27] showed that it was possible to predict with some accuracy the transfer compatibility between a target and a source, using a set of descriptive statistical properties of both the source and the target tasks. However, the approach (based on a Random Forrest model) was not interpretable or explainable and relied on the training and testing performance on the source and target tasks.
An example of an approach that attempts to predict TL success can be found in [7]. The method, called Predict To Learn (P2L), is developed for DL models, and validated on image classification and natural language processing problems. P2L, evaluated on 21 different tasks, showed that it was able to select the best source task for a given target task with 62% accuracy. At the core of P2L, there is a measure, called embedding divergence, which is used to estimate the relative improvement that a transferred model can achieve in the target task compared to a randomly chosen model. The measure considers two aspects: the size (number of training instances) of the source task dataset and a measure of dissimilarity between the source and the target tasks. The former is based on experimental evidence that shows a correlation between the size of the source and performance on the target [18]. However, this is probably more relevant for DL models that require large amounts of data to fit all of the network parameters. The latter element, task dissimilarity, is what is often assumed to be a core ingredient in TL success, but difficult to pin down. In P2L, this is done by extracting the output vectors produced at specific layers of the neural networks, computing a summary statistic over all instances in the datasets, and using an appropriate distance measure, such as the Kullback-Leibler or Jensen-Shannon divergence. While their results are state-of-the-art, they are not highly accurate, cannot be easily applied to other types of ML models (such as GP), and do not provide an interpretable description of the success or failure of a particular TL between a source and target task. Moreover, the method requires a previously trained model and inference on the datasets.
Another approach to compare problems is Task2Vec [1], which relies on what is referred to as a probe neural network, which is used to extract an embedding of a problem based on the network parameters. This provides a useful way to compare problems of different dimensions. However, the method requires training on the target task and is focused on neural network models, with the selection of the probe neural network being an ad-hoc process. In [2], a measure of complexity to compare tasks is proposed, which also depends on a training process to be executed and is also focused on neural network models. Other related works have focused on choosing the best data from a source domain where many samples are available, to enhance training on a target domain where the number of samples is insufficient to properly train a model [16]. On the other hand, [43] analyzes when TL is successful or not by studying specific aspects of computer vision tasks with neural networks, to build a structured representation of such tasks and a taxonomy of TL in this domain.
Probably the best example of a method that can be used to compare two problems, compute a measure of similarity, and use this measure to predict the success or failure of TL is the Optimal Transport Dataset Distance (OTDD) [5]. OTDD is a metric that can be used to compare two problem datasets, which can in principle be applied to datasets with different number of samples, classes and feature dimensions, and does not require model training or inference. It has shown a strong correlation with TL performance on computer vision and text analysis tasks, correctly identifying the best source problem for a given task from the same domain. Moreover, it can be applied in a supervised (considering target labels) or unsupervised (without target labels) manner. An efficient and open source implementation is also available.1
It is important to note that in all previous studies related to predicting TL performance, the focus has been on using neural network models and most use problems from similar domains (vision tasks). On the other hand, our study focuses on GP, TL is studied between problems from different domains, and similarly to OTDD it does not require model training or inference, can be applied to problems with different number of samples, classes or features (by applying feature reduction).

Summary of our proposed approach and experimental methodology. Our analysis is based on the case study from [27] shown on the left, that includes the problem dataset, dimensionality reduction and TL compatibility scores. On the right, our proposed approach applied DeepInsight to extract a feature space representation, which is then used to compare problems based on a registration process and evaluated using the TL compatibility.
This section presents the main experimental goals, the experimental data used and the proposed methodology, which is summarized in Fig. 1. Overall, the study has two main elements, the first one is based on our reference case study originally reported in [27]. We employ the same set of reference problems, preprocessing the datasets to obtain an homogeneous dimensionality for all problems while maintaining a high information content. Moreover, we use the exhaustive TL evaluation from [27] as our ground-truth reference to establish when two problems can be considered to be compatible for TL.
The second element of our study represents our main contribution, the proposed approach toward determining problem similarity for TL and our experimental validation. First, for each problem we extract a 2D feature space representation using the DeepInsight method [36], which allows us to compare problems using a point cloud registration, providing an objective measure of problem similarity. These measures are then used to determine when two problems are compatible for TL, and the process is evaluated based on the data and results reported in [27]. The reference case study is presented next, along with the background related to the DeepInsight method [36].
Case study
The main elements of the experimental work presented in [27] are summarized next. The GP system used was M3GP [28], which is a constructive induction method that uses a wrapper approach to evolve feature transformations while the final predictive model is generated by a simple ML algorithm. For regression M3GP was combined with multiple linear regression, while for classification the Mahalanobbis distance (MD) was employed. Solutions in M3GP are represented by multi-tree individuals, where each subtree of the root node represents a new feature for the problem, such that the new feature space is expected to be better suited for learning. In particular, given a real valued feature space
At the beginning of the run, all individuals encode a single transformation (
While [27] studied TL in both regression and classification tasks, this work focuses on classification. TL was evaluated considering all possible combinations of source and targets, using eight multiclass problems. These are: HRT, IM-3, WAV, SEG, IM-10, YST, VOW and ML; each is summarized in Table 1.
The first thing to note is that the problems are from very diverse domains, including satellite data in IM-3 and IM-10 [38], a medical dataset (HRT), image segmentation data (SEG), a yeast cell localization problem (YST), synthetic (WAV) [6] and real-world (VOW) waveform analysis, and sign language recognition (ML), with HRT, SEG, VOW, YEST and ML from the Keel repository [4]. These classification problems also represent various degrees of difficulty, sample size, feature space size and types of features reasonably well. The problems range from being well-represented (IM-10 has 6,798 instances) to sparsely represented (HRT has 270 instances). Moreover, these datasets range between low dimensionality (IM-3 and IM-10 have six features) to high dimensionality (ML has 90 features). Finally, our datasets include binary and real-valued features. One final aspect is that the problems have different types and numbers of labels, making them difficult to compare. Thus, we carefully choose these benchmark problems so that the performed evaluations are not-problem dependent.
To perform TL with M3GP the evolved feature transformation from one task was used to build a new feature space in the target task before applying the ML algorithm using only the transformed target task data. This leads, however, to the feature alignment problem, given that the source and the target tasks may have a different number of raw features, and because there is no a priori correspondence between both feature spaces to determine how to apply the evolved feature transformations from the source on the target’s features. Therefore, the performance of TL was evaluated as follows. First, the feature space of all problems was reduced to six features using principal component analysis (PCA) or mutual information (MI). Then, all possible combinations, of both feature reduction strategies and feature correspondences were tested. There are four feature reduction permutations for the source and target (PCA/MI, MI/PCA, PCA/PCA, MI/MI) and 720 possible alignments between the feature spaces of both tasks.2
Using six features for the source and the target there are a total of 720 different feature alignments between them.
The success or failure of TL was quantified based on the idea of a positive transfer. This occurs when the performance of a transferred model, applied on the target task, is better than the performance of a baseline method, based on testing data. Two baselines were considered for classification tasks, the Mahalanobis distance (MD) classifier and M3GP. The first one provides a simple but powerful baseline [8], while the second has been shown to be a powerful state-of-the-art GP-based classification algorithm [35,41]. Moreover, TL compatibility is proposed in [27] as a measure that summarizes the number of positive transfers that occur when different feature reduction strategies are used for the source and target tasks. The TL compatibility is between zero and four, equal to the number of feature reduction permutations that produce at least one positive transfer (among all possible feature alignments between both tasks). Table 2 summarizes the TL compatibility scores for all source and target combinations, considering both baseline methods (MD and M3GP). This gives us four TL compatibility values for any two problems, for the two baseline methods and the two possibilities of each problem being the source or the target. Hence, TL compatibility scores for every problem pair are presented in format
TL compatibility scores for every problem pair, with the TL compatible problems (TLCP) group in a grey background and the TL non-compatible problems (TLNP) group in a white background
We want insights that allow us to determine when a given problem pair represents a good candidate for TL, without requiring any information beyond a problem’s feature space (after feature reduction). In [7], the authors attempt to determine which source is best suited for a particular target, but to do this, the model is required and it is necessary to apply the model on the target data. Similarly, in our previous work, knowledge of the performance of the baseline methods is required [27]. Since we will not use the feature transformation, or any learning process, to analyze the observed TL compatibility, we can be more confident that we are basing our analysis on intrinsic properties of a problems feature space.
What is required is a distance or similarity measure between any two problems [7], that can explain the underlying assumption in any TL setup; i.e. that the source and target tasks are similar in some way. One aspect to consider, however, is that [27] showed that TL is not symmetrical, which imposes a limitation regarding how two problems should be analyzed, contrary to what is assumed in other approaches [5]. From the results presented in Table 2, we make the following observations. In all problem pairs at least one positive transfer is obtained, since at least one score is non-zero in all cases. Some problem pairs clearly exhibit the non-symmetrical result discussed above, for example HRT and VOW (0, 0, 4, 3) and HRT and VOW (0, 0, 4, 3) or IM-3 and ML(1, 1, 4, 2). In other cases, TL works very well relative to the simpler baseline (MD) but not to the more powerful classifier (M3GP), for instance SEG and IM-10 (4, 0, 4, 0). Moreover, we can define at least two groups of problem pairs based on their four TL compatibility scores.
First group: TL compatible problems (TLCP). This group is composed by all problem pairs with at least three non-zero TL compatibility values, out of the four reported in Table 2, highlighted in gray. This means that independent of which problem is the source, and which is the target, TL can produce at least one positive transfer between these two tasks.
Second group: TL non-compatible problems (TLNP). These are the remaining problem pairs, where TL was least successful or where performance was non-symmetric.
It is clear that both groups are somewhat strict in the manner in which they are defined. TLCP requires that TL be successful independent of which is the target and which is the source task. This would represent an ideal condition that is often implicitly assumed to be true for TL applications, such that the similarity between both problems leads to successful transfers in both directions. On the other hand, TLNP problem pairs might exhibit good TL performance from one problem to the other, but not necessarily when the roles of source and target are inverted. This situation is important in practice, since choosing the correct source for a given target is critical.
To compute a similarity measure, we could take different approaches. One way might be to use descriptive measures of a problem, what could be referred to as meta-features, which focus on how complex or difficult a given problem is [26]. However, such measures are difficult to apply in multiclass problems, particularly when different problems have a different number of classes. Another approach is the one followed by OTDD, for instance, which focuses on comparing the empirical distributions of each dataset and measures what is required to transform the distributions of one dataset to match the distributions of another [5]. In our work, however, we hypothesize that the relative relationship between the features is a key element to differentiate between compatible and non-compatible problems for TL. In other words, we propose to analyze how different features relate to one another to extract a geometric representation of a problem’s feature space, and to our knowledge the only approach that can be used to extract such a representation is the one proposed with the DeepInsight method [36]. As described in the following section, DeepInsight projects the feature space of a problem onto a 2D Euclidean plane, allowing us to compare how different problems are projected. Our hypothesis is that when the projected feature space of two problems overlap then they are more likely to be compatible for successful TL to occur between them.
DeepInsight
In recent years, explainable AI (XAI) has pushed ML research to develop methods that can produce solutions which can be understood by humans [3]. While this is a potential feature of a method like GP [30], it can be elusive for complex models like those produced by DL. Moreover, the goal is to not only explain the models themselves, but to better understand the learning problem itself. DeepInsight, for instance, transforms non-image data into images, with two benefits. First, it allows for powerful convolutional neural networks (CNN) that have had great success in vision tasks to be applied to a broader set of problem domains. Second, and more relevant for our purposes, it provides a natural way to interpret a problem’s data and extract useful insights regarding the learning task. DeepInsight was applied to problems from diverse fields such as text analysis and RNA sequencing, extracting imperative information in an accessible format for the human observer [36].
The key element in the DeepInsight pipeline is to use kernel Principal Component Analysis (kPCA) (or potentially other types of feature space transformations) in an unconventional manner. The training set in many supervised learning problem can be defined as follows. The input samples are given by

Feature space alignment with DeepInsight: (a)–(b) show the feature space projection of a source (a) and a target (b) task, and how the features are used as input for an evolved model; (c) shows that after the feature space of both problems are aligned then the model from the source problem can be used in a better way with the features from the target.
Instead of projecting data samples into a transformed feature space, as is normally done, DeepInsight projects the features of a problem onto a 2D plane or digital image. In this plane, the only visible pixels are the feature of a problem, their geometric distribution within the image is determined by their relative similarities, or correlation, based on the corresponding feature values of each sample in the dataset. In this way, each sample in a dataset can be represented by an image where the intensity value of the relevant pixels (problem features) is based on its feature space representation. The DeepInsight feature space images do take into account all of the samples in the dataset to extract their visual representations. In our study, however, it is the geometric distribution of a problem’s feature space that is of interest, instead of the images of specific samples (as is done in [36]), since we want to compare complete problems instead of single instances. DeepInsight allows us to compare how different problem features relate to one another, based on how they are projected onto the 2D plane. The core of our proposed approach is to determine how similar these projections are based on their potential overlap.
Figure 2 depicts the rationale of our proposed approach. Figure 2(a) shows the DeepInsight projection of the feature space from a source problem with three features, and how each feature corresponds to a particular input of an evolved model. On the other hand, Fig. 2(b) shows the feature space for a target task, and, as explained before, in most cases the features of the source and the target will not correspond to each otther (the feature alignment problem). A naïve application of TL would simply feed the first feature of the target to the first input of the model, but this will mostly lead to negative transfers. However, positive transfers do sometimes occur in our case study [27], considering all possible permutations of the feature space of the target problem. We hypothesize that positive transfers can be predicted by first aligning both feature spaces, as shown in Fig. 2(c). After aligning the feature spaces if a close correspondence exists between the feature spaces of the source and the target, then the possibility of a positive transfer should substantially increase. Our proposal quantifies how well the feature space of two problems can be aligned, and uses this as a measure of problem similarity to predict a TL compatibility. We hypothesize that problem pairs in the TLCP group can be closely aligned, while the converse should be true for the problem pairs in the TLNP group.
Experiments
Results and analysis
The following experiments with our proposed methodology and DeepInsight, as provided by the authors [36], were implemented using Matlab 2018. For each problem defined in Table 1 we extract its feature space representation using DeepInsight. This is done for both versions of the problem, using either PCA or MI as the feature reduction step. For instance, Fig. 3 shows a feature space representations extracted by DeepInsight for the IM3 and WAV problems using both PCA and MI for feature reduction. This problem pair is in the TLCP group, with positive transfers from both problems and based on both baselines, with TL compatibility scores of (2, 2, 4, 4); see Table 1.
Each plot (image) in Fig. 3 is a 2D representation of the feature space of each problem, depending on the feature reduction strategy that was used. In these figures each square represents one of the six features. Note that the axis of these plots are the principal components of the PCA applied to each dataset. When PCA in a standard way to X, each point in the image would represent a data sample and the axis of the plots would represent the directions of maximum variation in feature space. Conversely, these plots represent the directions of maximum variation in sample space after applying PCA on G, a unique representation of the problem.

DeepInsight feature space representation for the IM3 (blue) and WAV (red) problems, shown in the first two rows. ICP registration for two example cases (out of the eight possibilities of different feature reduction combinations) in the last row. The ICP distance in (e) is 5.66 and 7.78 in (f). Notice that in (e) three features from the different problems overlap after registration, explaining the low distance between that problem pair. On the other hand, none of the features in (f) overlap, thus the high ICP distance in that case. The final MRD for the problem pair is 5.66 since the minimum is considered.

DeepInsight feature space representation for the VOW and IM10 problems, showing the ICP 2D point cloud registration for two example cases (out of 8 possibilities). The ICP distance for (a) is 9.9294 and for (b) it is 9.6608, while the MRD for the problem pair is 9.66.
The goal is to compare the feature space representation of two problems to determine their similarity, based on how well they can be aligned. This is done by performing a registration of the DeepInsight images of two problems taken as 2D point clouds using the iterative closest point (ICP) method. ICP searches for the best rigid transformation (rotation and translation) of one point cloud to another (the reference image) and returns the average Euclidean distance between the resulting point cloud and the reference. The computed ICP measure is not symmetric, since it depends on the starting condition of the registration process; i.e. depends on which image is used as the reference. As an example, the last row of Fig. 3 shows the 2D registration results for the IM3 and WAV problems, and their associated distance returned by ICP on two cases. Note that this problem pair is in the TLCP group, with good TL results irrespective of which problem is used as source or target. Notice that in the last row of Fig. 3, after ICP registration of the 2D feature space representations, the features (shown as red and blue squares) of each problem tend to either overlap, in some cases, or lie very close to one another. For a counter example, Fig. 4 shows a problem pair (IM10 and VOW) from the TLNP group with TL compatibility scores of (2, 0, 2, 0). Notice that in this case the feature space representation of both problems do not match, even after registration, with several features from each problem lying far away from the features of the other problem.
Therefore, for any problem pair we compute eight values, considering the four feature reduction combinations (PCA-MI, MI-PCA, PCA-PCA and MI-MI), and the two possibilities of computing the ICP measure (alternating which problem is used as the reference). Finally, for our analysis we focus on the minimum of the eight values as a representative measure of problem similarity, we call this the Minimum Registration Distance (MRD) of the DeepInstight feature space images. MRD focuses on the minimum distance because it can be seen as the best case scenario for problem similarity, a baseline to determine when two problems might be a good match for a TL setup. The minimum registration distance is given in Table 3 for each problem pair. With this, we define the following two null hypothesis:
Results are summarized in Table 4, in particular the second row shows the results obtained with MRD. The mean and median MRD of the TLCP group are 6.79 and 6.90, while for the TLNP problems they are 7.86 and 7.75. To evaluate the hypothesis we use the Wilcoxon rank sum test for
MRD for each problem pair, TLCP group is in grey and the TLNP group is not
Statistical comparison of the proposed similarity measure (MRD) and a comparison with OTDD. Results show the mean and median for the TLCP and TLNP groups, and the p-values obtained when comparing both groups using the Wilcoxon and t-test satistical tests. OTDD is computed using different feature reduction strategies (PCA and MI), with
For comparison purposes, Table 4 presents similar results for the TLCP and TLNP groups using the OTDD measure. While in principle it is possible to compute an OTDD value for a pair of problems with a different feature space size, the open-source implementation of OTDD can only be computed for problems with the same number of features. Therefore we compute the score by applying the same feature reduction strategies, PCA and MI, and limiting the maximum number of feature dimensions to six. It is also possible to compute OTDD considering the class labels (supervised) or not (unsupervised). MRD does not account for class labels or the number of classes for the compared problems. Therefore, OTDD is computed using both configurations.
Results show that unlike MRD, most of the OTDD variants are not able to detect a statistically significant difference between the TLCP and TLNP groups. While [5] showed a strong correlation between OTDD and the performance of TL, the study only considered problems from similar domains (vision or text analysis). Moreover, when OTDD does not account class labels the difference between the TLCP and TLNP groups is significantly reduced.
Discussion
The results in Table 3 can be examined more closely. First, Fig. 5 uses the force directed layout of [12] to plot the resulting graph by considering the distance matrix in Table 3. In this plot, edge thickness is proportional to the MRD score and the edge color represents the respective TLC group, blue for TLCP and black for TLNP.

Graph visualization of the distance matrix presented in Table 3. Light blue edges represent the TLCP group while black edges the TLNP group.
Table 5 presents summary statistics of the MRD measure for each problem; i.e., the median and interquartile range (IQR) of the MRD for each problem relative to the others. The table also summarizes the number of times each problem appears in the TLCP and TLNP groups, respectively TLCP Count and TLNP Count. We can see that the MRD measure correlates with number of times a problem was in the TLCP and TLNP groups. For instance, VOW has the second largest median MRD and also a high (second highest) TLNP count, as could be expected, while SEG has the highest median MRD and is tied for the second highest TLNP count. On the other hand, ML has the lowest median MRD and only appears in the TLCP group, while the second lowest median MRD only appears once in the TLNP group. For the other problems there is some variability, but the general tendency supports the hypothesis testing results.
Summary statistics of the MRD measure for each problem compared with the rest, namely the median and IQR. Las two rows also show the TLCP count and TLNP count for each problem
There are, however, some potential limitations to the proposed approach, that should be addressed in future studies. While the overall methodology based on a 2D projection of the feature space of each problem is straightforward, the current implementation relies on several heuristic design choices. Namely, the feature reduction methods considered (PCA and MI), the number of features used (currently six) and the manner in which MRD is aggregated based on multiple comparisons (currently using the minimum value). Given the above, the MRD measure is not a formal metric, unlike OTDD, which limits its generalization and requires empirical evidence to validate its ability to predict the outcome of TL between two problems. However, it must be pointed out that even OTDD, which is based on a formal derivation and is defined as a metric in the space of possible datasets, at the moment it too suffers from similar issues at the implementation level.
This work presents an analysis of TL for a GP system, proposing a method to determine when two problems might be good candidates for TL to be applied. Each problem is characterized by a feature space representation extracted using DeepInsight, and it is hypothesized that when two problems share a similar feature space geometry then TL can be successfully applied between them. Unlike previous works, this paper focuses on predicting the performance of TL for GP models, while previous studies have focused on neural network approaches. The approach does not require previously trained models or model training to compute the similarity measure between problems, making it distinct with respect to previous works, a feature that is relevant when using computationally costly methods such as GP. The proposal is evaluated on problems from very diverse domains using a varied set of feature spaces and number of classes, while most previous works have focused on computer vision tasks when analyzing the success of TL.
Results show that by aligning the feature space representation of both problems it is possible to compute a distance measure between them and to use this distance to characterize the problem pair. In particular, we classify problem pairs into two groups, TLCP contains pairs that are highly compatible based on TL performance between them, while TLNP contains problem pairs where TL was unsuccessful or only partially.3
When TL was successful when transferring from one problem to another but not when the roles of target and source were inverted.
While initial results are encouraging, there are several future lines of research that should be explored. First, replicate the presented study on a larger set of classification and regression tasks, to determine how general are the presented results to supervised learning overall. Second, evaluate the proposed methodology using different feature reduction methods and allow for the comparison of problems with different feature space sizes, reducing the number of heuristic design choices in the proposed methodology. Finally, while the ICP is a well-known method for point set registration, other approaches are available, such as robust point matching, spline robust matching or coherent point drift, among others, which might be used to simplify the computation of the proposed MRD.
Footnotes
Acknowledgements
This work was supported by CONACYT (Mexico) project CF-2023-I-724, TecNM (Mexico) project 16788.23-P and second author was supported by CONACYT post-graduate scholarship number 832861.
Compliance with ethical standards
The authors have no competing interests to declare that are relevant to the content of this article.
