Abstract
An increasing area of study for economists and sociologists is the varying organizational structures between business networks. The use of network science makes it possible to identify the determinants of the performance of these business networks. In this work we look for the determinants of inter-firm performance. On one hand, a new method of supervised clustering with attributed networks is proposed, SUWAN, with the aim at obtaining class-uniform clusters of the turnover, while minimizing the number of clusters. This method deals with representative-based supervised clustering, where a set of initial representatives is randomly chosen. One of the innovative aspects of SUWAN is that we use a supervised clustering algorithm to attributed networks that can be accomplished through a combination of weights between the matrix of distances of nodes and their attributes when defining the clusters. As a benchmark, we use Subgroup Discovery on attributed network data. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. On the other hand, in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. The proposed methodologies are applied to an inter-organizational network, the EuroGroups Register, a central register that contains statistical information on business networks from European countries.
Introduction
The organizational structure of firms is in constant change. Organizations tend to adapt and change in order to gain a competitive advantage. Nevertheless, firms can also accomplish their goals through collaboration with other organizations. Indeed, it is assumed that organizations are influenced by their inter-organizational relationships [1]. Whether they are defined as strategic alliances, trade networks, joint ventures, or considered to be a result of the nature of the industry or local circumstances, they are seen as a mode of economic cooperation [2]. An inter-organizational network represents the relationships between different organizations, where organizations are represented as vertices, and their relationships by edges. Inter-firm or inter-organizational networks are important structures with a propensity for displaying community structure. It means that groups of densely linked vertices that are poorly connected to other groups of vertices may be revealed [3]. According to Harenberg et al. [4], community detection consists of detecting groups of densely connected nodes that typically have fewer connections to nodes outside that group. Therefore, the aim of community detection algorithms is to find cohesive subgraphs of nodes that can be representative of a community, focusing on the structural aspects of the network. A remarkable contribution to the field was made by Newman, who developed an algorithm that measures the quality over the possible divisions of a network, by optimizing the quality function known as modularity over the possible divisions of a network [5]. On the other hand, stands a clustering approach, that allows studying a set of elements by splitting it into smaller groups with similar characteristics, with focus on their compositional characteristics, i.e., characteristics related to the attributes of the nodes. Cluster algorithms aim at finding patterns in data points, based on their similarity. A cluster is a group of data points that are similar to each other based on their relation to surrounding data points. For this approach, two main methodologies are highlighted, hierarchical and partitional clustering algorithms. Partitional clustering methods discover all clusters concurrently and do not enforce a hierarchical structure, whereas hierarchical clustering algorithms find nested partitions iteratively [6].
Hoberecht et al. [7], state that organizational networks aim to improve performance. According to the authors, there are many reasons for establishing a network, as for instance, when organizations want to achieve a specific goal that is shared by another organization. Moreover, to maximize supply chain efficiency and profitability, the corporate community invests in inter-organizational networks. Matous and Todo [8], studied the impact of the network topology and diffusion on Japanese automobile production networks, that reveal the reorganization of inter-organizational networks and the organization performance coevolution. Indeed, one important goal of the study of inter-firm networks (IFN), is to analyze the impact of network topology measures on performance measures, such as company’s turnover.
This work is developed in two different and complementary ways. Firstly, we propose SUWAN, the SUpervised clustering With Attributed Networks, a new method of supervised clustering that can be used to detect important patterns in the turnover of inter-organizational networks, consisting of clustering groups of nodes, considering both the structural characteristics (topology of the network, based on relationships between nodes), and compositional characteristics (related to nodes’ attributes). SUWAN is based on the Single Representative Insertion/Deletion Hill Climbing with Restart (SRIDHCR) algorithm from Zeidat and Eick [9]. As a benchmark, subgroup discovery is used to detect and identify relevant network patterns [10]. The goal is to discover interesting associations among different variables with respect to a property of interest. In contrast with standard community detection methods, SUWAN assembles elements of a graph, based on their structural and compositional characteristics, while it provides class-uniform clusters, based on a predefined target variable. The reason why we have chosen Subgroup Discovery as a benchmark is that it provides a description and identification of communities based on the combination of their features.
From a supervised clustering perspective, some authors argue that classical techniques of clustering do not guarantee that objects of the same class are grouped together, e.g., [11]. The proposed method uses the methodology of supervised clustering algorithms that can deal with this limitation by improving a measure of clusters purity. Furthermore, SUWAN tackles the unexplored field of applying supervised clustering on attributed networks. Additionally, another contribute of this work is the exploration of supervised clustering using a benchmark with subgroup discovery on inter-organizational networks. Indeed, classical community detection techniques focus only on finding subgroup of nodes with a dense structure, lacking an interpretable description. Subgroup Discovery can deal with description-oriented community detection. Moreover, this approach can also provide insights beyond connectivity within communities, and the relationships between subgroups of nodes as well.
Complementarily, and in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. This way, to study the networks topology, some essential measures can be examined to study the networks compactness, centrality, and density and a regression model has been developed. A multiple regression model is applied to the networks under analysis. This way, the variables of the total turnover and the network topology measures were placed as dependent and independent variables, accordingly. Data from the EuroGroups register of Eurostat [12] is used to illustrate both approaches.
The paper is structured as follows: related work is presented in Chapter 2. SUWAN, the supervised clustering algorithm with attributed networks is introduced in Chapter 3. In Chapter 4, an application of the described methodology is performed on the EuroGroups Register. Finally in Chapter 5, conclusion and new challenges are discussed.
Related work
Clustering is a methodology consisting of grouping data according to a desired criterion, allowing finding a structure in a dataset [13]. This way, the main goal of clustering analysis is to group a set of items into homogenous groups, also referred as clusters, with a pre-determined measure of similarity. When clustering is performed, the similarity measure should be higher within groups, when comparing to the similarity between different groups [14].
Beyond the structural form, networks may contain additional information, which can be related to the entities and their relationships [15]. On attributed networks, nodes and/or edges are labeled with additional information, allowing further dimensions for detecting patterns that describe a specific subset of nodes of the network. According to Vieira, Campos and Brito [16], community detection and classic clustering techniques, only consider the structural information of networks. Moreover, the authors refer the use of the additional information about the nodes attributes to enhance the output of the algorithms, through the combination of information about the network architecture and the properties describing the vertices.
Unlike traditional clustering methodology that works around non-labeled data, the assumption in supervised clustering is that the items are classified. The objective of supervised clustering is to find class-uniform groups of items, that have a high probability density with respect to a single class [17]. The problem of supervised clustering may be presented as a pair (
Zeidat and Eick [9] presented a k-medoid-style clustering algorithms for a supervised clustering, the k-medoid model aiming to search for
Where
and
This way, the algorithm builds an initial solution, using as representative objects, the members of the most frequent class in the data set. Then, the algorithm repeatedly and greedily inserts non-representative objects to the current set of representatives that yields the lowest value for the fitness function
Similar work was developed by Gan et al. [20], that employed a novel graph-based classification method, called Supervised clustering-based Regularized Least Squares Classification (SuperRLSC). The proposed methodology is based on the idea that supervised clustering may uncover more genuine data structures, when compared with the traditional clustering. This way, a supervised k-means algorithm is used, in order to partition the dataset into different meaningful clusters. Similar to Zeidat and Eick [9], a fitness function is designed, that allows to find as many homogeneous clusters as possible, while minimizing the number of clusters.
Finley and Joachims [21], present a method of supervised clustering with k-means algorithm. This work is based on the idea that, in order to successfully implement k-means, a similarity measure that reflects the properties of the cluster must be chosen prudently. Hence, a Structural Support Vector Machine (SSVM – a generalization of a SVM) method is implemented to perform the k-means algorithm as a supervised task. This methodology uses a SSVM approach to learn a parameterized distance measure such that k-means may provide the preferred clusters and maximize the cluster accuracy. This way, the similarity measure is learned through given training examples of item sets with proper clustering so that future sets of items are grouped similarly.
In the same line of thought, Al-Harbi and Rayward-Smith [11], propose a new method of adapting k-means for supervised clustering. The authors argue that the traditional k-means algorithm for unsupervised clustering does not guarantee to group the same classes of objects together. Therefore, it proposes an adaptation of k-means, as a classifier clustering algorithm. The proposed method attempts to partition the objects that have the same label into the same cluster, by modifying strategic steps of the traditional k-means algorithm, namely, the Euclidean metric and the objective function. The Euclidean metric used on k-means is transformed into a weighted Euclidean metric measuring the distance between
The use of supervised clustering on attributed networks is rather new. The possibility of combining both the structural (topological) and compositional (attribute based) characteristics is one of the advantages of our proposal. Indeed, this is one of the limitations of the existing methods. For example, Ji et al. [22] already address a solution for this problem within the field of heterogeneous information networks. The authors use a co-clustering approach of heterogeneous nodes based on constrained orthogonal non-negative matrix tri-factorization. However, these methods are applied to networks containing different types of nodes, while we consider a single type of nodes.
Self Organized Maps (SOM) [23], also combines the topology and similarity of nodes. However, SOM uses a different concept for graph structure: in SOM a data item is mapped into the node whose model is most similar to the data item, and the whole set can be regarded to constitute a similarity graph. In SOM, a graph is computed by using similarity (compositional/attribute) characteristics of the items only, and not the structural ones. In some related work, and also in the approach we will follow, graphs are also based on the structural relationships between the items. As an example, a company can own another company (parental relationship) and several companies can be owned by the same company.
A methodology on description-oriented community detection has been proposed by Atzmueller, Doerfel and Mitzlaff [24]. This view is based on the fact that classical community detection techniques focus only on finding subgroup of nodes with a dense structure, lacking an interpretable description. For this matter, Subgroup Discovery, a data mining technique that focus on discovering interesting relationships between different objects [25], can provide insights beyond connectivity within communities, and the relationships between subgroups of nodes as well. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. Therefore, interesting patterns among subgroups can be revealed, for example, by inductive and exploratory data analysis tasks that find relations between a dependent and (several) independent variables [26], considering the compositional aspect of the networks. This way, with the additional information supplied by attributed networks, Subgroup Discovery method can be applied in order to combine both structural and compositional characteristics of the network.
In fact, Atzmueller and Lemmerich [27] first proposed a subgroup discovery algorithm, COMODO, which is an adaptation of SD-MAP algorithm. SD-Map is an exhaustive subgroup discovery algorithm that consists in an adaptation of Frequent Pattern Growth method (FP-growth) for the subgroup discovery task. The FP-growth algorithm uses a compressed representation of the itemset database. The tree growths by tracking each itemset and mapping it to a path. A quality function is used to assess a subgroup description and rate the identified subgroups during search, given a certain target variable [27]. Examples for quality functions are given by Eq. (4), where
We will use Subgroup Discovery methodology on attributed social networks as a benchmark for our method. One of the advantages of Subgroup Discovery is that it can deal with description-oriented community detection. However, Subgroup Discovery allows an overlapping of nodes between subgroups, which is not suitable when we want a partition where nodes should not overlap.
SUWAN is a SUpervised Clustering Algorithm for (With) Attributed Networks. This new methodology employs the SRIDHCR algorithm [9] that consists of representative-based supervised clustering with the addition of a quality function that assesses cluster purity and quantity. SUAWN allows clustering groups of nodes (one single type of types – as in homogeneous networks), considering both the structural and compositional characteristics of the network. Additionally, it is suitable for either categorical or numerical attributes. For this purpose, a fitness function is defined, as described in Eq. (5). The input parameters concern the target variable, the penalty,
The pseudocode for SUWAN algorithm is presented in Fig. 1. The first step of the algorithm refers to the application on attributed networks that considers both structural and compositional characteristics of the network Thereby, two matrixes are computed, one that measures the distances between nodes attributes,
As in the SRIDHCR algorithm, the proposed method starts by randomly selecting a set of initial representatives, denominated by curr (Step 2.1). The number of elements contained in this solution define the number of clusters
The algorithm then starts to generate new possible candidates,
Pseudocode for SUWAN algorithm. Source: the author.
Measuring the quality of the clustering is an important step of the method’s implementation since it enables the comparison with other procedures. However, evaluating the quality of a clustering is challenging as the correct clusters are not known. The implemented methodology works around labeled data. The cluster evaluation can be accomplished through the purity of the clusters, and therefore, a new measure of the overall quality based on the cluster’s purity was computed to achieve the quality of the clustering.
Let the classes in the data set
We apply SUWAN to recent information from the EuroGroups Register (EGR). EGR is a system of registers that includes a central register maintained by Eurostat as well as registers in each EU Member State and EFTA country [28]. Information regarding international company groups is kept in the central registry. The central register keeps track of multinational corporations with statistically significant financial and non-financial transnational operations in at least one European country. The EGR database is composed by a several distinct groups of firms, that form multinational corporations. Those multinational groups can be seen as inter-organizational networks, making the EGR database suitable for a network analysis.
In order to fully understand how the EGR network is formed, some basic concepts such as Multinational Enterprise Group, Legal Unit, Enterprise, Global Group Head and Global decision Center, must be disclosed. Therefore, a summary of concepts is presented below in Table 1.
Summary of basic concept of EGR database
Summary of basic concept of EGR database
Legal Units (LEUs) are not always represented in the same way by the sources used for statistical business registrations. These units might differ across nations and between various sources within a country. As a result, the LEU is ineffective as a statistical unit, especially in international comparisons. The Legal Unit is always the foundation for the statistical unit known as the “enterprise”, either alone or in conjunction with other Legal Units.
The enterprise is the statistical unit at which information about its transactions is kept, including financial and balance-sheet accounts, and from which international transactions, an international investment position (if applicable), consolidated financial position, and net worth can be calculated.
The EGR data base is accessed through relational tables, which hold information on the Multinational Enterprise Groups networks, where each table contains a primary key that identifies the groups, legal units, and enterprises, with the attributes GEG_EGR_ID, LEU_EGR_ID and ENT_EGR_ID, accordingly. Also, these variables allow identifying the connections within tables, as secondary keys. This way, it is possible to identify networks, in EGR by establishing affiliation links between organizations, that can be perceived as a “parent” and “child” relationship between organizations.
Examples of networks topologies of different Group Heads and their Legal Units, with igraph R package. Source: the author.
For each main group, there is a parent legal unit that assumes the overall control, denominated as Global Group Head. This group detains a certain number of legal units and enterprises, identified by the attributes GEG_N_LEU and GEG_N_ENT, respectively. Figure 2 illustrates the graphical representation three different Global Group Head networks
A list of attributes used to perform SUWAN and subgroup discovery on attributed networks is presented below in Table 2, where some attributes are a result of a categorization. A LEU can assume different forms, such as, limited liability company (LL), sole proprietor (SP), partnership (PA), government (GO), nonprofit body (NB), natural person (NP) or not defined (ND). The size of the Legal Unit is defined by the number of persons employed (LEU_PERS_EMPL), and it includes the total number of persons who work in the observation unit, as well as persons who work outside the unit but belong and are paid by it, such as sales representatives. Moreover, persons that are absent for a short period, on strike, part-time and seasonal workers, apprentices, and home workers on payroll are included in the counting.
Description of relevant attributes of EGR database
The turnover class variable, groups the turnover values of each Enterprise into 6 classes. A Legal Unit may not have an associated turnover. On the other hand, an Enterprise has always an associated turnover, since that an Enterprise is a Legal Unit producing economic goods and services. These cases represent the Legal Units that are not Enterprises. From the network point of view, each LEU represents a node, so some nodes will not have a defined turnover class but still belong to the network structure.
Regarding the composition of the variables presented in Table 2, a few were obtained by combining different variables from the original dataset. For the turnover class variable, it was used the enterprise turnover values, as described in Table 3, where a turnover class of 1, indicates a turnover of zero, while class 99, refers to turnover values not defined. The size class variable was built with the number of persons employed of the enterprise (Table 4), where a class of 99 indicates a not defined number of employees.
Turnover class based on the enterprise turnover values
Size of the enterprise based on number of persons employed
In this chapter, an application of the described methodology (both SUWAN and Subgroup Discovery as a benchmark), is performed on the EuroGroups Register. The analysis is performed for the most recent year of 2018, which contains over 6870 groups of networks, even though not all of them are suitable for the analysis. A subset of the total number of networks for the year of 2018 [12], was performed based on a predefined criterion. This way, the groups under analysis concern the networks which have an UCI based in Portugal and with a minimum number of nodes of 20. In total, SUWAN and subgroup discovery were applied in 67 networks, that contain a total of 3848 LEUs.
As previous described, clusters are obtained by associating nodes to the closest representative. Thus, the way clusters are formed depend on the weighted metric, that merges the weights of nodes attributes and network distances. This weight is defined by the parameter
The optimized value of
For each one of the 67 networks under analysis, it was calculated the proportion of explained pseudo-inertia between
For both methods, SUWAN and Subgroup Discovery (SD), a target variable is defined to obtain the class labels. For the EGR network, the focus is to form cluster of enterprises that have the same turnover class. The results on the Portuguese UCI networks produced an average overall quality of 0.532 and 0.726 for the algorithm of SUWAN and SD, accordingly (Table 5). This means that, on average, SD produced more pure clusters, when comparing with SUWAN. Concerning the number of clusters/subgroups produced, SD produces 3.8 subgroups, while SUWAN produced 3.3 clusters, on average.
Summary table of average performance of SUWAN and SD methods
Although SUWAN algorithm produced, on average, lower quality clusters, more that 11% of the networks achieved an overall quality higher than 70%. The results for these eight networks (38, 25, 48, 51, 32, 21, 41, and 50), are presented in Table 6 below, and a graphical representation of the networks is shown in Fig. 4.
SWUAN results on ERG network, with Portuguese UCI
Normalized proportion of explained pseudo-inertia for the distances between nodes attributes (
Graphical representation of SUWAN on Networks 38, 25, 48, 51, 32, 41, 50 and 21, with cluster identification. Source: the author.
Network number 38, composed by 24 LEUs, achieved the maximum overall quality of 1, meaning that, the clusters obtained are all class-uniform. In this case, the network presents only four class labels, that correspond to the number of levels assumed by the target variable turnover class and the number of associated clusters.
Attribute list of nodes from Network_ID 38
From Table 7, it can also be retrieved that this network presents the same type of Legal Unit (L), and Limited Liability form (LL) for all observations. The majority of LEUs are from Portugal (PT), with the exception of one, that is settled in Spain (ES). Furthermore, cluster 1, is composed by two LEUs that belong to class 4 of turnover, that ranges between 10 and 50 million euros. Cluster 2 is classified by turnover class 3, with a range between 2 and 10 million euros. On the other hand, cluster 3 is classified by turnover class of 1, that corresponds to a turnover of zero. This cluster if formed by LEUs with the lowest size class and with the majority of economic activity related to the financial service activities (K.64). Lastly, cluster 4 is classified by turnover class of 2, categorized by less than 2 million euros of turnover. This could be explained by the fact that most LEUs of this cluster dedicate their economic activity to agriculture and have the lowest size class, corresponding to 0–1 persons employed.
For network number 25, the SUWAN algorithm grouped the 39 nodes into five clusters, with an overall quality of 0.923. The first cluster has a majority turnover class of 1, with 12 cases. This cluster contains LEUs from type L and form LL. Apart from one case, in Morocco, all LEUs from this cluster are in Portugal. Also, the size class of the cluster is dispersed around classes 1 and 2, with one observation belonging to class 3. The economic activities of this cluster vary between agriculture and financial service activities. On the other hand, clusters 2 and 5, with 7 and 14 nodes, respectively, are pure clusters, with turnover class 2. Also, this clusters only contains LEUs with the attributes of type L, form LL, country PT, size class 2 and NACE div A.01. Cluster 3 just contains a node, that corresponds to the only observation with a turnover class of 4. Finally, cluster 4, with 5 nodes, is also a pure cluster of turnover class 3. This cluster contains LEUs from types L and B, form LL, countries of Portugal and Spain, size class of 4 and 3 and economic activity of agriculture and manufacture of food products. Network number 48 resulted in 3 clusters, where cluster 1 contains turnover classes of 1 and 2, while clusters 2 and 3 are pure, with turnover classes of 2 and 4, accordingly. The clustering of network number 51 also revealed the majority of clusters class uniform, except for cluster 4, that contains turnover classes of 1, 5 and 99.
We used Subgroup Discovery (SD) as a benchmark for SUWAN. The results obtained with SD method, for the same set of networks, are presented below in Table 8. With this algorithm, there are several differences to highlight, like the number of nodes used in each network, that differs from the ones obtained from SUWAN. The reason behind this difference lies in the fact that SD does not group all nodes in clusters, but instead, it finds subgroups of nodes with an associated description. Besides that, the algorithm allows an overlapping of nodes between subgroups, that affects the number of unique nodes used on the SD task.
Subgroup discovery results on ERG network, with Portuguese UCI
This way, for network number 38, SD generated 4 subgroups, where from the 24 existing nodes, only 5 were grouped. In this specific case, the same five nodes were grouped in four different subgroups, with different descriptions. In fact, the subgroups found are subgroups of each other, with more specific descriptions for the same set of nodes (Table 9). The overall quality of this network with subgroup discovery is 80% since there is only one node with a different class label between each subgroup.
Subgroup discovery output for Network_ID 38
From the set of networks presented in Table 8, subgroup discovery produced different outcomes from the previous referred. In the case of network number 50, it produced a unique subgroup with 29 observations, based on the description on the economic activity of the LEUs, that correspond to the Human and Health activities. The target class of this subgroup ranges between the values of 2, 3, 4 and 99, with the majority of observations belonging to class 2.
Graphical representation of Subgroup Discovery on network 48, with colour identification of the subgroups. Source: the author.
On the other hand, network 48 (Fig. 5), has grouped 20 of the nodes, with an overlapping of 37.5%. These subgroups are described based on the size class and country code. The first subgroup of size 6, contains observations with turnover class of 1 and 2, and it is described by a size class of 2. The second subgroup belongs to the first one, with the addition of the country code (PT) in description. The last subgroup, with 20 nodes, is described by the country code (PT). In this case, the target class varies among the values 1, 2 and 4. Figure 5 shows the three subgroups in the same network, separately, due to the overlapping of nodes in the different subgroups.
The methodologies presented and analyzed, have different times of learning the clusters. In fact, SUWAN methodology proved to take a higher computational time, when comparing with SD. The algorithm implemented by SUWAN searches the solution space to find an optimal solution, that maximizes the quality function. This way, the algorithm goes through an iterative process, where single non-representatives are added and removed from the current solution, that can take more or less time, depending on the size of the network. Either way, due to the iterative process, SUWAN takes longer to present a final solution, which leads to a higher computation time. Therefore, when comparing the computational time of both methods, SUWAN can take up longer than Subgroup Discovery, when presenting the final clusters. The computational time of SUWAN is highly related with the size of the network, contrary to SD, that is not affected by it.
As a complementary task, it was also possible to observe that the variables with higher impact to determine the LEU’s turnover class are size class and the economic activity of the group (NACE Div) For the NACE Div attribute, it is possible to infer that turnovers of less than 2 million euros (classes 1 to 2) have the most frequent activities of financial services (K.64), real states (L.68) and Professional and scientific and technical activities (M.70). On the other hand, for higher turnover classes of 4 and 5, with more than 10 million euros, the activities with more frequency are in the field of wholesale and retail trade, repair of motor vehicles and motorcycles (G.46, G.45 and G.47), electricity, gas, steam, and air conditioning supply (D.35), manufacturing (C.10, C.16) and construction (F.42 and F.41). For the size class, it was possible to observe that the majority of LEU’s, with lower turnover classes of 1 and 2, are more condensed in lower size classes of 1 to 2. On the opposite side, LEUs with higher turnover classes have more observations for the size classes of 5 and 6.
Due to the selection of networks with Portuguese UCI, the variables concerning the type and legal form of LEU’s proved to be irrelevant for the SUWAN, since that are fewer cases where the attributes type and form are different from L and LL, respectively.
Network topology impact on performance
In order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. This way, to study the networks topology, some essential measures can be examined to study the networks compactness, centrality, and density (Table 10).
Summary on essential network topology measures used to study the impact on the organizational performance
Summary on essential network topology measures used to study the impact on the organizational performance
Correlation plot between the variable’s diameter, average degree, average closeness, average betweenness, density and total turnover. Source: the author.
The group performance can be measured by its turnover. Hence, a new variable, that indicates the total turnover of the group, was computed, denominated by sum_ent_turnov. Based on the turnover presented by each enterprise that belongs to the group, the total turnover was obtained through the sum of those values. Analyzing the correlation between variables in Fig. 6, it was possible to retain that the pairs of variables average closeness/density and diameter/average betweenness are positively correlated, with correlation values of 0.93 and 0.88, respectively. On the other hand, with a negative correlation value of
When analyzing the multiple linear regression output of Table 11, it is possible to observe that the variable that has more impact in the group’s turnover is the average closeness, with an estimate value of 7.524e11, although it does not present a significant p-value. In the same way, none of the variables proved to be significant to the model. Looking at the F-statistics and the overall
Multiple linear regression model with network topology measures as independent variables and total turnover as dependent variable
The approach of using both the information about the network structure and the attributes of the nodes in the clustering process proved to be feasible. It enabled the creation of clusters/subgroups that are not only densely linked, but also class-uniform, in terms of the target class that describe those vertices. A characteristic of interest is defined beforehand, as a target class, that allows to obtain clusters/subgroups based on a class label.
The application of supervised clustering on attributed networks seems to be a promising technique. Atzmueller [30] applied the alternative Subgroup Discovery methodology on attributed social networks. For this purpose, the principles of Subgroup Discovery have been adapted to the dyadic network setting, detecting compositional patterns and capturing subgroup of nodes, estimated by a quality measure. The subgroup discovery was implemented on the EGR networks with the SD-MAP algorithm, using the preprocessing of COMODO algorithm, that combines the graph structure and the descriptive information of the vertices.
On the side of supervised clustering approach, the SRIDHCR algorithm proposed by Zeidat and Eick [9], was adapted to consider both structural and compositional characteristics of the EGR network. Moreover, the original algorithm was also adapted for the implementation on categorical variables, through a variation of the k-means, known by k-modes.
In a preliminary analysis of the outputs produced by both methodologies, it was concluded that Subgroup Discovery produced better clusters/subgroups, with higher overall quality, in comparison with SUWAN. However, Subgroup Discovery achieved better results due to the lack of nodes grouped, and by allowing an overlapping of nodes between subgroups. On the other hand, the main focus of Subgroup Discovery is to find subgroups of nodes, described by patterns and with a determined quality measure.
The SUWAN method also produced quite good results, with high-level cluster purity among the studied cases. This method groups all nodes of the network into clusters, contrary to Subgroup Discovery (SD).
Regarding the computational time of the presented methodologies, SUWAN proved to be less efficient, when presenting the solution. This is due to the iterative process that the SUWAN algorithm uses when searching the solution space, that is highly influenced by the size of the network. On the other hand, SD is not affected by the size of the network.
The focus of the work was to obtain class-uniform clusters, based on the EuroGroups LEUs turnover class, using SUWAN. The analysis of results allowed to verify certain patterns in the nodes that compose the clusters. Clusters with the majority class of turnovers that range between 1 and 2, are formed by LEUs that employ less persons. Similarly, clusters with class labels of turnover ranging between 5 and 6, are assembled by legal units with size classes of higher levels. Therefore, the turnover is clearly affected by the size of the legal unit.
Additionally, the analysis on the network topology impact on performance proved that there is no significant evidence of a relationship between the total turnover variable and network topology measures of diameter, average degree, closeness and betweenness.
Furthermore, this study revealed that SUWAN involves certain challenges when applied to attributed networks. One of the challenges is the parameterization of variables that influence the clustering output. For example, the importance of the network topology, established by
As future developments, we aim at developing new evaluation methods that can be based on a better trade-off between network topology and node characteristics. In addition, since this paper dealt with undirected networks, it is also important to develop measures to deal with directed networks, as well.
