SUWAN: A supervised clustering algorithm with attributed networks

Abstract

An increasing area of study for economists and sociologists is the varying organizational structures between business networks. The use of network science makes it possible to identify the determinants of the performance of these business networks. In this work we look for the determinants of inter-firm performance. On one hand, a new method of supervised clustering with attributed networks is proposed, SUWAN, with the aim at obtaining class-uniform clusters of the turnover, while minimizing the number of clusters. This method deals with representative-based supervised clustering, where a set of initial representatives is randomly chosen. One of the innovative aspects of SUWAN is that we use a supervised clustering algorithm to attributed networks that can be accomplished through a combination of weights between the matrix of distances of nodes and their attributes when defining the clusters. As a benchmark, we use Subgroup Discovery on attributed network data. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. On the other hand, in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. The proposed methodologies are applied to an inter-organizational network, the EuroGroups Register, a central register that contains statistical information on business networks from European countries.

Keywords

SUWAN supervised clustering attributed networks subgroup discovery

1. Introduction

The organizational structure of firms is in constant change. Organizations tend to adapt and change in order to gain a competitive advantage. Nevertheless, firms can also accomplish their goals through collaboration with other organizations. Indeed, it is assumed that organizations are influenced by their inter-organizational relationships [1]. Whether they are defined as strategic alliances, trade networks, joint ventures, or considered to be a result of the nature of the industry or local circumstances, they are seen as a mode of economic cooperation [2]. An inter-organizational network represents the relationships between different organizations, where organizations are represented as vertices, and their relationships by edges. Inter-firm or inter-organizational networks are important structures with a propensity for displaying community structure. It means that groups of densely linked vertices that are poorly connected to other groups of vertices may be revealed [3]. According to Harenberg et al. [4], community detection consists of detecting groups of densely connected nodes that typically have fewer connections to nodes outside that group. Therefore, the aim of community detection algorithms is to find cohesive subgraphs of nodes that can be representative of a community, focusing on the structural aspects of the network. A remarkable contribution to the field was made by Newman, who developed an algorithm that measures the quality over the possible divisions of a network, by optimizing the quality function known as modularity over the possible divisions of a network [5]. On the other hand, stands a clustering approach, that allows studying a set of elements by splitting it into smaller groups with similar characteristics, with focus on their compositional characteristics, i.e., characteristics related to the attributes of the nodes. Cluster algorithms aim at finding patterns in data points, based on their similarity. A cluster is a group of data points that are similar to each other based on their relation to surrounding data points. For this approach, two main methodologies are highlighted, hierarchical and partitional clustering algorithms. Partitional clustering methods discover all clusters concurrently and do not enforce a hierarchical structure, whereas hierarchical clustering algorithms find nested partitions iteratively [6].

Hoberecht et al. [7], state that organizational networks aim to improve performance. According to the authors, there are many reasons for establishing a network, as for instance, when organizations want to achieve a specific goal that is shared by another organization. Moreover, to maximize supply chain efficiency and profitability, the corporate community invests in inter-organizational networks. Matous and Todo [8], studied the impact of the network topology and diffusion on Japanese automobile production networks, that reveal the reorganization of inter-organizational networks and the organization performance coevolution. Indeed, one important goal of the study of inter-firm networks (IFN), is to analyze the impact of network topology measures on performance measures, such as company’s turnover.

This work is developed in two different and complementary ways. Firstly, we propose SUWAN, the SUpervised clustering With Attributed Networks, a new method of supervised clustering that can be used to detect important patterns in the turnover of inter-organizational networks, consisting of clustering groups of nodes, considering both the structural characteristics (topology of the network, based on relationships between nodes), and compositional characteristics (related to nodes’ attributes). SUWAN is based on the Single Representative Insertion/Deletion Hill Climbing with Restart (SRIDHCR) algorithm from Zeidat and Eick [9]. As a benchmark, subgroup discovery is used to detect and identify relevant network patterns [10]. The goal is to discover interesting associations among different variables with respect to a property of interest. In contrast with standard community detection methods, SUWAN assembles elements of a graph, based on their structural and compositional characteristics, while it provides class-uniform clusters, based on a predefined target variable. The reason why we have chosen Subgroup Discovery as a benchmark is that it provides a description and identification of communities based on the combination of their features.

From a supervised clustering perspective, some authors argue that classical techniques of clustering do not guarantee that objects of the same class are grouped together, e.g., [11]. The proposed method uses the methodology of supervised clustering algorithms that can deal with this limitation by improving a measure of clusters purity. Furthermore, SUWAN tackles the unexplored field of applying supervised clustering on attributed networks. Additionally, another contribute of this work is the exploration of supervised clustering using a benchmark with subgroup discovery on inter-organizational networks. Indeed, classical community detection techniques focus only on finding subgroup of nodes with a dense structure, lacking an interpretable description. Subgroup Discovery can deal with description-oriented community detection. Moreover, this approach can also provide insights beyond connectivity within communities, and the relationships between subgroups of nodes as well.

Complementarily, and in order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. This way, to study the networks topology, some essential measures can be examined to study the networks compactness, centrality, and density and a regression model has been developed. A multiple regression model is applied to the networks under analysis. This way, the variables of the total turnover and the network topology measures were placed as dependent and independent variables, accordingly. Data from the EuroGroups register of Eurostat [12] is used to illustrate both approaches.

The paper is structured as follows: related work is presented in Chapter 2. SUWAN, the supervised clustering algorithm with attributed networks is introduced in Chapter 3. In Chapter 4, an application of the described methodology is performed on the EuroGroups Register. Finally in Chapter 5, conclusion and new challenges are discussed.

2. Related work

Clustering is a methodology consisting of grouping data according to a desired criterion, allowing finding a structure in a dataset [13]. This way, the main goal of clustering analysis is to group a set of items into homogenous groups, also referred as clusters, with a pre-determined measure of similarity. When clustering is performed, the similarity measure should be higher within groups, when comparing to the similarity between different groups [14].

Beyond the structural form, networks may contain additional information, which can be related to the entities and their relationships [15]. On attributed networks, nodes and/or edges are labeled with additional information, allowing further dimensions for detecting patterns that describe a specific subset of nodes of the network. According to Vieira, Campos and Brito [16], community detection and classic clustering techniques, only consider the structural information of networks. Moreover, the authors refer the use of the additional information about the nodes attributes to enhance the output of the algorithms, through the combination of information about the network architecture and the properties describing the vertices.

Unlike traditional clustering methodology that works around non-labeled data, the assumption in supervised clustering is that the items are classified. The objective of supervised clustering is to find class-uniform groups of items, that have a high probability density with respect to a single class [17]. The problem of supervised clustering may be presented as a pair ( $X, C$ ), where $X$ denotes a limited set of items $X=\{x_{1},\ldots,x_{n}\}$ and $C$ denotes a group of distinct and nonempty subsets $C_{1},\ldots,C_{k}$ of $X$ Similarly to traditional clustering, a distance function can be formalized as $\textit{dist}(x,y)$ , where $x$ and $y$ represent two distinct data points. Depending on the type of data, different functions can be used to measure the distance between data points. For numerical data, the most commonly used are the Euclidean distance, dE and the Manhattan distance, dM (Eq. (1)). On the other hand, for nominal attributes, the most used measure is based on a simple matching method, denominated by Hamming distance (Eq. (2)), where $\textit{dH}(x,y)$ corresponds to the number of places where $x$ and $y$ are different [18].

$\displaystyle\textit{dE}(x,y)=\sqrt{(x-y)^{2}}\quad\textit{dM}(x,y)=|(x-y)^{2}% |\quad\textit{dH}(x,y)=|{\{{{i:x}_{i}\neq{y}_{i}}\}}|$ (1)

Zeidat and Eick [9] presented a k-medoid-style clustering algorithms for a supervised clustering, the k-medoid model aiming to search for $k$ representative objects, known as medoids, that reduce the average dissimilarity of all the data set’s objects to the closest medoid [19]. This way, the clusters are obtained based on the group of objects that have been assigned to the same medoid. One of the proposed algorithms, based on k-medoid model, is Supervised Partitioning Around Medoids (SPAM). This technique (SPAM) uses a fitness function, instead of the dissimilarity measure, and the number of clusters $k$ , as an input parameter. The fitness function is presented in Eq. (2), where $n$ represents the total number of examples, $c$ the number of classes and $\beta$ defines the penalty associated with the number of clusters $k$ . The class impurity measures the percentage of minority examples in the different clusters of a certain clustering. The goal is to minimize $q(X)$ , to obtain class-uniform clusters, while minimizing the number of associated clusters.

$\displaystyle{q}({X})=\textit{Impurity}({X})+\beta\textit{Penalty}({k})$ (2)

Where

$\displaystyle\textit{Impurity}({X})=\frac{\textit{\# of Minority Examples}}{{n}}$

and

$\displaystyle\textit{Penalty}({k})=\left\{{{\begin{array}[]{l}{\sqrt{\frac{{k-% c}}{{n}},k\geqslant{c}}}\\ {{0,k<c}}\\ \end{array}}}\right.$

This way, the algorithm builds an initial solution, using as representative objects, the members of the most frequent class in the data set. Then, the algorithm repeatedly and greedily inserts non-representative objects to the current set of representatives that yields the lowest value for the fitness function $q(X)$ . Another algorithm proposed by the authors is the Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Starting (SRIDHCR). In the same way as SPAM, this method tries to minimize the quality function $q(X)$ . It starts by randomly selecting a number of objects as an initial set of representatives. Then, by inserting and deleting single items from the existing collection of cluster representatives, it greedily seeks for solutions. The major difference in this algorithm relies on the fact that the $k$ number of clusters is not fixed, as the algorithm searches for an optimized value of clusters.

Similar work was developed by Gan et al. [20], that employed a novel graph-based classification method, called Supervised clustering-based Regularized Least Squares Classification (SuperRLSC). The proposed methodology is based on the idea that supervised clustering may uncover more genuine data structures, when compared with the traditional clustering. This way, a supervised k-means algorithm is used, in order to partition the dataset into different meaningful clusters. Similar to Zeidat and Eick [9], a fitness function is designed, that allows to find as many homogeneous clusters as possible, while minimizing the number of clusters.

Finley and Joachims [21], present a method of supervised clustering with k-means algorithm. This work is based on the idea that, in order to successfully implement k-means, a similarity measure that reflects the properties of the cluster must be chosen prudently. Hence, a Structural Support Vector Machine (SSVM – a generalization of a SVM) method is implemented to perform the k-means algorithm as a supervised task. This methodology uses a SSVM approach to learn a parameterized distance measure such that k-means may provide the preferred clusters and maximize the cluster accuracy. This way, the similarity measure is learned through given training examples of item sets with proper clustering so that future sets of items are grouped similarly.

In the same line of thought, Al-Harbi and Rayward-Smith [11], propose a new method of adapting k-means for supervised clustering. The authors argue that the traditional k-means algorithm for unsupervised clustering does not guarantee to group the same classes of objects together. Therefore, it proposes an adaptation of k-means, as a classifier clustering algorithm. The proposed method attempts to partition the objects that have the same label into the same cluster, by modifying strategic steps of the traditional k-means algorithm, namely, the Euclidean metric and the objective function. The Euclidean metric used on k-means is transformed into a weighted Euclidean metric measuring the distance between $x$ and $y$ (Eq. (3)), that partitions the data according to the different labels, and assigning a greater weight to a chosen field, which has a more significant relationship whit those class labels. The process of choosing the appropriate set of weights can be seen as an optimization problem, addressed by any metaheuristic technique. In this case, Simulated Annealing is used to find the best set of weights for the clustering problem. The goal is to make the k-means algorithm’s divisions as confident as possible.

$\displaystyle{\delta}({{x,y}})=\sqrt{\mathop{\sum}\limits_{{i=1}}^{n}({{x}_{i}% -y_{i}})^{2}}\quad{\delta}_{w}({{x,y}})=\sqrt{\mathop{\sum}\limits_{{i=1}}^{n}% {w}_{i}({{x}_{i}-y_{i}})^{2}}$ (3)

The use of supervised clustering on attributed networks is rather new. The possibility of combining both the structural (topological) and compositional (attribute based) characteristics is one of the advantages of our proposal. Indeed, this is one of the limitations of the existing methods. For example, Ji et al. [22] already address a solution for this problem within the field of heterogeneous information networks. The authors use a co-clustering approach of heterogeneous nodes based on constrained orthogonal non-negative matrix tri-factorization. However, these methods are applied to networks containing different types of nodes, while we consider a single type of nodes.

Self Organized Maps (SOM) [23], also combines the topology and similarity of nodes. However, SOM uses a different concept for graph structure: in SOM a data item is mapped into the node whose model is most similar to the data item, and the whole set can be regarded to constitute a similarity graph. In SOM, a graph is computed by using similarity (compositional/attribute) characteristics of the items only, and not the structural ones. In some related work, and also in the approach we will follow, graphs are also based on the structural relationships between the items. As an example, a company can own another company (parental relationship) and several companies can be owned by the same company.

A methodology on description-oriented community detection has been proposed by Atzmueller, Doerfel and Mitzlaff [24]. This view is based on the fact that classical community detection techniques focus only on finding subgroup of nodes with a dense structure, lacking an interpretable description. For this matter, Subgroup Discovery, a data mining technique that focus on discovering interesting relationships between different objects [25], can provide insights beyond connectivity within communities, and the relationships between subgroups of nodes as well. Subgroup Discovery focuses on detecting subgroups described by specific patterns that are interesting with respect to some target concept and a set of explaining features. Therefore, interesting patterns among subgroups can be revealed, for example, by inductive and exploratory data analysis tasks that find relations between a dependent and (several) independent variables [26], considering the compositional aspect of the networks. This way, with the additional information supplied by attributed networks, Subgroup Discovery method can be applied in order to combine both structural and compositional characteristics of the network.

In fact, Atzmueller and Lemmerich [27] first proposed a subgroup discovery algorithm, COMODO, which is an adaptation of SD-MAP algorithm. SD-Map is an exhaustive subgroup discovery algorithm that consists in an adaptation of Frequent Pattern Growth method (FP-growth) for the subgroup discovery task. The FP-growth algorithm uses a compressed representation of the itemset database. The tree growths by tracking each itemset and mapping it to a path. A quality function is used to assess a subgroup description and rate the identified subgroups during search, given a certain target variable [27]. Examples for quality functions are given by Eq. (4), where $n$ indicates the size of the subgroup, for a certain population size N, while $p$ and $p_{0}$ are the relative frequencies of the target variable and in the total population, respectively. According to Atzmueller [26], the quality functions for binary and nominal target quality functions are based on the share of target concept in the subgroup and the overall population, respectively. Therefore, any of the quality functions presented in Eq. (4), would be suitable for the analysis. We proceeded the analysis using SD-Map algorithm, with the default quality function of Piatetsky-Shapiro, $q_{PS}$ .

$\displaystyle{q}_{\textit{WRACC}}=\frac{{n}}{{N}}({{p-p}_{0}})\quad{q}_{% \textit{PS}}=n({{p-p}_{0}})\quad{q}_{\textit{LIFT}}=\frac{{p}}{{p}_{0}}$ (4)

We will use Subgroup Discovery methodology on attributed social networks as a benchmark for our method. One of the advantages of Subgroup Discovery is that it can deal with description-oriented community detection. However, Subgroup Discovery allows an overlapping of nodes between subgroups, which is not suitable when we want a partition where nodes should not overlap.

3. SUWAN and the EuroGroups Register

SUWAN is a SUpervised Clustering Algorithm for (With) Attributed Networks. This new methodology employs the SRIDHCR algorithm [9] that consists of representative-based supervised clustering with the addition of a quality function that assesses cluster purity and quantity. SUAWN allows clustering groups of nodes (one single type of types – as in homogeneous networks), considering both the structural and compositional characteristics of the network. Additionally, it is suitable for either categorical or numerical attributes. For this purpose, a fitness function is defined, as described in Eq. (5). The input parameters concern the target variable, the penalty, $\beta$ , associated with the number of clusters, and the weight of the network distances, $\alpha$ . The number of classes, $t$ , is established by the number of unique values that the target variable can assume. The target variable should be selected to reflect a characteristic of interest when forming clusters.

$\displaystyle Q(x)=\textit{Impurity}(x)+\beta\times\textit{Penalty}({k})$ (5)

The pseudocode for SUWAN algorithm is presented in Fig. 1. The first step of the algorithm refers to the application on attributed networks that considers both structural and compositional characteristics of the network Thereby, two matrixes are computed, one that measures the distances between nodes attributes, $D_{1}$ , and other that measures the distance between all nodes in the network, $D_{2}$ . With this, a weighted matrix is obtained, with a ponderation defined by $\alpha$ , that determines the weight given to the distances between the nodes on the network. The SRIDHCR algorithm [9] is based on k-means, which uses a Euclidean metric. Consequently, it is suitable only for numeric variables. To work around this limitation, a dissimilarity measure for categorical variables, named Hamming distance (Eq. 2.6), is incorporated in SUWAN algorithm. This measure is used to obtain the distances between nodes attributes ( $D_{1}$ ). Hence, SUWAN employs either k-modes or k-means algorithms, according to the type of data.

As in the SRIDHCR algorithm, the proposed method starts by randomly selecting a set of initial representatives, denominated by curr (Step 2.1). The number of elements contained in this solution define the number of clusters $k$ . As previously referred, the goal is to minimize $Q(x)$ , to obtain class-uniform clusters, while minimizing the number of associated clusters. Therefore, $k$ is not fixed, as the algorithm searches for an optimized number of clusters. By assigning each node to the closest representative with a weighted matrix, clusters are obtained.

The algorithm then starts to generate new possible candidates, $s$ , by adding and removing a single non-representative node from the current solution, keeping the solution that improves the quality function. The algorithm then terminates when $Q(x)$ reaches an optimum. Nevertheless, the algorithm can still keep iterating while reducing the number of clusters and without improving $Q(x)$ (Step 2.2.4).

Figure 1.

Pseudocode for SUWAN algorithm. Source: the author.

3.1 Evaluation measures

Measuring the quality of the clustering is an important step of the method’s implementation since it enables the comparison with other procedures. However, evaluating the quality of a clustering is challenging as the correct clusters are not known. The implemented methodology works around labeled data. The cluster evaluation can be accomplished through the purity of the clusters, and therefore, a new measure of the overall quality based on the cluster’s purity was computed to achieve the quality of the clustering.

Let the classes in the data set $A$ be $T=(t_{1},\ldots,t_{i})$ , and the number of clusters $C$ be $C=\{{{C}_{1},..,C_{k}}\}$ . The clustering output is presented in a table format, with $k$ lines and $i$ columns, indicating the number of clusters and classes, respectively. For each cluster, the purity is determined as presented in Eq. (6), where ${PR}_{k}({{t}_{i}})$ is the proportion of class ${t}_{i}$ in cluster ${C}_{k}$ . The overall quality measure is then computed by the total purity of the whole clustering, given by Eq. (7), where $|C_{k}|$ is the total number of nodes in cluster $k$ , and ${|C|}$ the total number of nodes of the network.

$\displaystyle\textit{Purity}({{C}_{k}})=\mathop{\text{max}}\limits_{i}({PR}_{k% }({{t}_{i}}))$ (6) $\displaystyle\textit{Purity}_{\textit{total}}({C})=\mathop{\sum}\limits_{{k=1}% }^{j}\frac{|C_{k}|}{{|C|}}\times\textit{Purity}({{C}_{k}})$ (7)

3.2 Data: The EuroGroups Register

We apply SUWAN to recent information from the EuroGroups Register (EGR). EGR is a system of registers that includes a central register maintained by Eurostat as well as registers in each EU Member State and EFTA country [28]. Information regarding international company groups is kept in the central registry. The central register keeps track of multinational corporations with statistically significant financial and non-financial transnational operations in at least one European country. The EGR database is composed by a several distinct groups of firms, that form multinational corporations. Those multinational groups can be seen as inter-organizational networks, making the EGR database suitable for a network analysis.

In order to fully understand how the EGR network is formed, some basic concepts such as Multinational Enterprise Group, Legal Unit, Enterprise, Global Group Head and Global decision Center, must be disclosed. Therefore, a summary of concepts is presented below in Table 1.

Table 1
Summary of basic concept of EGR database

Concept	Definition
Multinational Enterprise Group	Enterprise group that has at least two enterprises or legal units. Can be domestic
	or foreign controlled
Legal Unit (LEU)	Individuals or institutions legally recognized by law or that are engaged in an
	economic activity
Enterprise (ENT)	Legal Unit producing economic goods and services
Global Group Head (GGH)	Parent legal unit of an enterprise that is not controlled by any other legal unit.
	Unit on top of the control chain of the group
Global Decision Center	Unit where strategic decisions are taken. The goal is to produce meaningful
Ultimate Controlling Institutional Unit	statistics

Legal Units (LEUs) are not always represented in the same way by the sources used for statistical business registrations. These units might differ across nations and between various sources within a country. As a result, the LEU is ineffective as a statistical unit, especially in international comparisons. The Legal Unit is always the foundation for the statistical unit known as the “enterprise”, either alone or in conjunction with other Legal Units.

The enterprise is the statistical unit at which information about its transactions is kept, including financial and balance-sheet accounts, and from which international transactions, an international investment position (if applicable), consolidated financial position, and net worth can be calculated.

The EGR data base is accessed through relational tables, which hold information on the Multinational Enterprise Groups networks, where each table contains a primary key that identifies the groups, legal units, and enterprises, with the attributes GEG_EGR_ID, LEU_EGR_ID and ENT_EGR_ID, accordingly. Also, these variables allow identifying the connections within tables, as secondary keys. This way, it is possible to identify networks, in EGR by establishing affiliation links between organizations, that can be perceived as a “parent” and “child” relationship between organizations.

Figure 2.

Examples of networks topologies of different Group Heads and their Legal Units, with igraph R package. Source: the author.

For each main group, there is a parent legal unit that assumes the overall control, denominated as Global Group Head. This group detains a certain number of legal units and enterprises, identified by the attributes GEG_N_LEU and GEG_N_ENT, respectively. Figure 2 illustrates the graphical representation three different Global Group Head networks

A list of attributes used to perform SUWAN and subgroup discovery on attributed networks is presented below in Table 2, where some attributes are a result of a categorization. A LEU can assume different forms, such as, limited liability company (LL), sole proprietor (SP), partnership (PA), government (GO), nonprofit body (NB), natural person (NP) or not defined (ND). The size of the Legal Unit is defined by the number of persons employed (LEU_PERS_EMPL), and it includes the total number of persons who work in the observation unit, as well as persons who work outside the unit but belong and are paid by it, such as sales representatives. Moreover, persons that are absent for a short period, on strike, part-time and seasonal workers, apprentices, and home workers on payroll are included in the counting.

Table 2

Description of relevant attributes of EGR database

Attribute	Description
LEU_LEID	ID of the Legal Unit
LEU_TYPE	List of type of Legal Unit (Brach or not)
LEU_LFORM	List of legal forms of Legal Units
LEU_COUNTRY_CODE	List of 2-digit ISO country codes
SIZE_CLASS	Size of the enterprise based on persons employed
TURNOVER_CLASS	Turnover class based on the enterprise turnover values
NACE_DIV	2-digit NACE Rev. 2 codes of the main economic activity

The turnover class variable, groups the turnover values of each Enterprise into 6 classes. A Legal Unit may not have an associated turnover. On the other hand, an Enterprise has always an associated turnover, since that an Enterprise is a Legal Unit producing economic goods and services. These cases represent the Legal Units that are not Enterprises. From the network point of view, each LEU represents a node, so some nodes will not have a defined turnover class but still belong to the network structure.

Regarding the composition of the variables presented in Table 2, a few were obtained by combining different variables from the original dataset. For the turnover class variable, it was used the enterprise turnover values, as described in Table 3, where a turnover class of 1, indicates a turnover of zero, while class 99, refers to turnover values not defined. The size class variable was built with the number of persons employed of the enterprise (Table 4), where a class of 99 indicates a not defined number of employees.

Table 3

Turnover class based on the enterprise turnover values

Turnover class	Description
1	0
2	Less than 2 million
3	Between 2 and 10 million
4	Between 10 and 50 million
5	More than 50 million
99	Not defined

Table 4

Size of the enterprise based on number of persons employed

Size class	Description
1	0–1
2	2–9 persons employed
3	10–19 persons employed
4	20–49 persons employed
5	50–249 persons employed
6	250 or more persons employed
99	Not defined

4. Experimental results

In this chapter, an application of the described methodology (both SUWAN and Subgroup Discovery as a benchmark), is performed on the EuroGroups Register. The analysis is performed for the most recent year of 2018, which contains over 6870 groups of networks, even though not all of them are suitable for the analysis. A subset of the total number of networks for the year of 2018 [12], was performed based on a predefined criterion. This way, the groups under analysis concern the networks which have an UCI based in Portugal and with a minimum number of nodes of 20. In total, SUWAN and subgroup discovery were applied in 67 networks, that contain a total of 3848 LEUs.

As previous described, clusters are obtained by associating nodes to the closest representative. Thus, the way clusters are formed depend on the weighted metric, that merges the weights of nodes attributes and network distances. This weight is defined by the parameter $\alpha$ , that sets the importance of the network distances. It is noticeable that, when $\alpha$ is set to zero, clusters are formed based solely on the properties of the attributes, resulting in a significant dispersion between nodes from the same cluster on the network representation. On the opposite side, when only considering the networks distances, $\alpha=$ 1, nodes are grouped by closeness and clusters can be visually distinguished from each other. To determine the optimized value for $\alpha$ , an analysis on the proportion of explained pseudo inertia was performed. For this purpose, partitions were generated for a range of possible values of $\alpha$ and a certain number of clusters $k$ , so as to properly weight the relative weights of the distances between nodes attributes $D_{1}$ , and the distances between nodes $D_{2}$ [29]. To achieve this, a plot of the quality criterion $Q_{1}$ and $Q_{2}$ of the partitions ${P}_{k}^{\alpha}$ obtained with varying values of $\alpha$ was executed, as described in Eq. (8), where the denominator, ${W}_{\theta}({P_{1}})$ is the total mixed pseudo inertia, based on the dissimilarity matrix $\theta$ , the numerator, ${W}_{\theta}({{P}_{k}^{\alpha}})$ , is the mixed pseudo within-cluster inertia, based on the dissimilarity matrix $\theta$ , and $P_{1}$ is the unique initial partition in $k$ clusters that does not depend on $\alpha$ (i.e. there is just one observation in each cluster). When $\theta=$ 1, it represents the total proportion of the total pseudo inertia, based on the matrix of distances between the nodes attributes ( $D_{1}$ ), and $\theta=$ 2 represents the proportion of the total pseudo inertia, based on the matrix of distances between nodes ( $D_{2}$ ). The higher the value of the criterion ${Q}_{0}({{P}_{k}^{\alpha}})$ , the more homogenous the partition $P_{k}^{\alpha}$ is, from the node’s attributes point of view. In the same way, the higher the value of the criterion ${Q}_{1}({{P}_{k}^{\alpha}})$ the more homogeneous the partition $P_{k}^{\alpha}$ from the node’s distances point of view.

$\displaystyle{Q}_{\theta}({{P}_{k}^{\alpha}})=1-\frac{{W}_{\theta}({{P}_{k}^{% \alpha}})}{{W}_{\theta}({{P}_{1}})},\theta\in[1,2]$ (8)

The optimized value of $\alpha$ is the trade-off point between the loss of nodes attributes distances and the gain of nodes distances. Figure 3 gives an example of a plot for the normalized proportion of explained pseudo-inertia calculated with the matrix of attributes distances ( $D_{1}$ ) and the matrix of nodes distances ( $D_{2})$ .

For each one of the 67 networks under analysis, it was calculated the proportion of explained pseudo-inertia between $D_{1}$ and $D_{2}$ , with $\alpha$ ranging between [0, 1], and the number of clusters given by the number of class labels $t$ of the network, using ClustGeo package from R. The obtained average number of $\alpha$ on the 67 networks was 0.651. Therefore, the following analysis are performed with a weight of network distances of $\alpha=$ 0.7.

For both methods, SUWAN and Subgroup Discovery (SD), a target variable is defined to obtain the class labels. For the EGR network, the focus is to form cluster of enterprises that have the same turnover class. The results on the Portuguese UCI networks produced an average overall quality of 0.532 and 0.726 for the algorithm of SUWAN and SD, accordingly (Table 5). This means that, on average, SD produced more pure clusters, when comparing with SUWAN. Concerning the number of clusters/subgroups produced, SD produces 3.8 subgroups, while SUWAN produced 3.3 clusters, on average.

Table 5

Summary table of average performance of SUWAN and SD methods

Algorithm	Average #clusters/subgroups	Average overall quality
SUWAN	3.299	0.532
SD	3.761	0.726

Although SUWAN algorithm produced, on average, lower quality clusters, more that 11% of the networks achieved an overall quality higher than 70%. The results for these eight networks (38, 25, 48, 51, 32, 21, 41, and 50), are presented in Table 6 below, and a graphical representation of the networks is shown in Fig. 4.

Table 6

SWUAN results on ERG network, with Portuguese UCI

Network ID	#Clusters	#Nodes	Overall quality
38	4	24	1
25	5	39	0.923
48	3	21	0.905
51	5	23	0.870
32	4	23	0.783
21	4	33	0.758
41	4	24	0.750
50	4	32	0.719

Figure 3.

Normalized proportion of explained pseudo-inertia for the distances between nodes attributes ( $D_{1}$ ) and the distances between nodes ( $D_{2}$ ), with ClustGeo package from R. Source: the author.

Figure 4.

Graphical representation of SUWAN on Networks 38, 25, 48, 51, 32, 41, 50 and 21, with cluster identification. Source: the author.

Network number 38, composed by 24 LEUs, achieved the maximum overall quality of 1, meaning that, the clusters obtained are all class-uniform. In this case, the network presents only four class labels, that correspond to the number of levels assumed by the target variable turnover class and the number of associated clusters.

Table 7

Attribute list of nodes from Network_ID 38

Cluster	Type	Form	Country code	Size class	Turnover class	NACE div
1	L	LL	ES	1	4	C.10
	L	LL	PT	4	4	A.01
2	L	LL	PT	4	3	A.01
3	L	LL	PT	1	1	K.64
	L	LL	PT	1	1	K.64
	L	LL	PT	1	1	K.64
	L	LL	PT	1	1	A.01
	L	LL	PT	1	1	A.01
	L	LL	PT	1	1	K.64
4	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	A.01
	L	LL	PT	2	2	M.69
	L	LL	PT	1	2	A.01
	L	LL	PT	1	2	K.64

From Table 7, it can also be retrieved that this network presents the same type of Legal Unit (L), and Limited Liability form (LL) for all observations. The majority of LEUs are from Portugal (PT), with the exception of one, that is settled in Spain (ES). Furthermore, cluster 1, is composed by two LEUs that belong to class 4 of turnover, that ranges between 10 and 50 million euros. Cluster 2 is classified by turnover class 3, with a range between 2 and 10 million euros. On the other hand, cluster 3 is classified by turnover class of 1, that corresponds to a turnover of zero. This cluster if formed by LEUs with the lowest size class and with the majority of economic activity related to the financial service activities (K.64). Lastly, cluster 4 is classified by turnover class of 2, categorized by less than 2 million euros of turnover. This could be explained by the fact that most LEUs of this cluster dedicate their economic activity to agriculture and have the lowest size class, corresponding to 0–1 persons employed.

For network number 25, the SUWAN algorithm grouped the 39 nodes into five clusters, with an overall quality of 0.923. The first cluster has a majority turnover class of 1, with 12 cases. This cluster contains LEUs from type L and form LL. Apart from one case, in Morocco, all LEUs from this cluster are in Portugal. Also, the size class of the cluster is dispersed around classes 1 and 2, with one observation belonging to class 3. The economic activities of this cluster vary between agriculture and financial service activities. On the other hand, clusters 2 and 5, with 7 and 14 nodes, respectively, are pure clusters, with turnover class 2. Also, this clusters only contains LEUs with the attributes of type L, form LL, country PT, size class 2 and NACE div A.01. Cluster 3 just contains a node, that corresponds to the only observation with a turnover class of 4. Finally, cluster 4, with 5 nodes, is also a pure cluster of turnover class 3. This cluster contains LEUs from types L and B, form LL, countries of Portugal and Spain, size class of 4 and 3 and economic activity of agriculture and manufacture of food products. Network number 48 resulted in 3 clusters, where cluster 1 contains turnover classes of 1 and 2, while clusters 2 and 3 are pure, with turnover classes of 2 and 4, accordingly. The clustering of network number 51 also revealed the majority of clusters class uniform, except for cluster 4, that contains turnover classes of 1, 5 and 99.

We used Subgroup Discovery (SD) as a benchmark for SUWAN. The results obtained with SD method, for the same set of networks, are presented below in Table 8. With this algorithm, there are several differences to highlight, like the number of nodes used in each network, that differs from the ones obtained from SUWAN. The reason behind this difference lies in the fact that SD does not group all nodes in clusters, but instead, it finds subgroups of nodes with an associated description. Besides that, the algorithm allows an overlapping of nodes between subgroups, that affects the number of unique nodes used on the SD task.

Table 8

Subgroup discovery results on ERG network, with Portuguese UCI

Network ID	#Subgroups	#Nodes	#Unique nodes	Overall quality
38	4	24	5	0.800
25	4	39	11	0.688
48	3	21	20	0.531
51	3	23	21	0.303
32	4	23	22	0.525
21	4	33	9	0.800
41	4	24	7	0.750
50	1	32	29	0.655

This way, for network number 38, SD generated 4 subgroups, where from the 24 existing nodes, only 5 were grouped. In this specific case, the same five nodes were grouped in four different subgroups, with different descriptions. In fact, the subgroups found are subgroups of each other, with more specific descriptions for the same set of nodes (Table 9). The overall quality of this network with subgroup discovery is 80% since there is only one node with a different class label between each subgroup.

Table 9

Subgroup discovery output for Network_ID 38

Subgroup	Nodes_ID	Target class	Description
1	1	1	NACE Div $=$ K.64
	2	2
	3	1
	4	1
	5	1
2	1	1	NACE Div $=$ K.64 $+$ Country Cod $=$ PT
	2	2
	3	1
	4	1
	5	1
3	1	1	NACE Div $=$ K.64 $+$ Country Cod $=$ PT $+$ Size Class $=$ 1
	2	2
	3	1
	4	1
	5	1
4	1	1	NACE Div $=$ K.64 $+$ Size Class $=$ 1
	2	2
	3	1
	4	1
	5	1

From the set of networks presented in Table 8, subgroup discovery produced different outcomes from the previous referred. In the case of network number 50, it produced a unique subgroup with 29 observations, based on the description on the economic activity of the LEUs, that correspond to the Human and Health activities. The target class of this subgroup ranges between the values of 2, 3, 4 and 99, with the majority of observations belonging to class 2.

Figure 5.

Graphical representation of Subgroup Discovery on network 48, with colour identification of the subgroups. Source: the author.

On the other hand, network 48 (Fig. 5), has grouped 20 of the nodes, with an overlapping of 37.5%. These subgroups are described based on the size class and country code. The first subgroup of size 6, contains observations with turnover class of 1 and 2, and it is described by a size class of 2. The second subgroup belongs to the first one, with the addition of the country code (PT) in description. The last subgroup, with 20 nodes, is described by the country code (PT). In this case, the target class varies among the values 1, 2 and 4. Figure 5 shows the three subgroups in the same network, separately, due to the overlapping of nodes in the different subgroups.

The methodologies presented and analyzed, have different times of learning the clusters. In fact, SUWAN methodology proved to take a higher computational time, when comparing with SD. The algorithm implemented by SUWAN searches the solution space to find an optimal solution, that maximizes the quality function. This way, the algorithm goes through an iterative process, where single non-representatives are added and removed from the current solution, that can take more or less time, depending on the size of the network. Either way, due to the iterative process, SUWAN takes longer to present a final solution, which leads to a higher computation time. Therefore, when comparing the computational time of both methods, SUWAN can take up longer than Subgroup Discovery, when presenting the final clusters. The computational time of SUWAN is highly related with the size of the network, contrary to SD, that is not affected by it.

4.1 Variables impact on performance

As a complementary task, it was also possible to observe that the variables with higher impact to determine the LEU’s turnover class are size class and the economic activity of the group (NACE Div) For the NACE Div attribute, it is possible to infer that turnovers of less than 2 million euros (classes 1 to 2) have the most frequent activities of financial services (K.64), real states (L.68) and Professional and scientific and technical activities (M.70). On the other hand, for higher turnover classes of 4 and 5, with more than 10 million euros, the activities with more frequency are in the field of wholesale and retail trade, repair of motor vehicles and motorcycles (G.46, G.45 and G.47), electricity, gas, steam, and air conditioning supply (D.35), manufacturing (C.10, C.16) and construction (F.42 and F.41). For the size class, it was possible to observe that the majority of LEU’s, with lower turnover classes of 1 and 2, are more condensed in lower size classes of 1 to 2. On the opposite side, LEUs with higher turnover classes have more observations for the size classes of 5 and 6.

Due to the selection of networks with Portuguese UCI, the variables concerning the type and legal form of LEU’s proved to be irrelevant for the SUWAN, since that are fewer cases where the attributes type and form are different from L and LL, respectively.

4.2 Network topology impact on performance

In order to analyze the impact of the network’s topology on the group’s performance, some network topology measures, and the group total turnover were exploited. This way, to study the networks topology, some essential measures can be examined to study the networks compactness, centrality, and density (Table 10).

Table 10
Summary on essential network topology measures used to study the impact on the organizational performance

Measure	Description
Diameter	Measures how compact the network is
Density	Measures the connectivity of the network
Average Degree Centrality	Measures the connectivity of nodes, on average
Average Betweenness Centrality	Measures the capacity of information flow between nodes, on average
Average Closeness Centrality	Measures the influence of nodes in the entire network, on average

Figure 6.

Correlation plot between the variable’s diameter, average degree, average closeness, average betweenness, density and total turnover. Source: the author.

The group performance can be measured by its turnover. Hence, a new variable, that indicates the total turnover of the group, was computed, denominated by sum_ent_turnov. Based on the turnover presented by each enterprise that belongs to the group, the total turnover was obtained through the sum of those values. Analyzing the correlation between variables in Fig. 6, it was possible to retain that the pairs of variables average closeness/density and diameter/average betweenness are positively correlated, with correlation values of 0.93 and 0.88, respectively. On the other hand, with a negative correlation value of $-$ 0.93 and $-$ 0.7, are the pairs of variables average degree/average closeness and diameter/average closeness. Although neither of the variables seems to be correlated with the total turnover of the group, an analysis of a multiple regression model was performed to the 67 networks under analysis. This way, the variables of the total turnover and the network topology measures were placed as dependent and independent variables, accordingly.

When analyzing the multiple linear regression output of Table 11, it is possible to observe that the variable that has more impact in the group’s turnover is the average closeness, with an estimate value of 7.524e11, although it does not present a significant p-value. In the same way, none of the variables proved to be significant to the model. Looking at the F-statistics and the overall $p$ -value, it can be concluded that we reject the hypothesis of a relationship between the dependent variables and the independent variables. Therefore, there is no significant evidence that a relationship the total turnover variable and network topology variables exists.

Table 11

Multiple linear regression model with network topology measures as independent variables and total turnover as dependent variable

Residuals	Min	1Q	Median	3Q	Max
	$-$ 2.126e ${}^{11}$	$-$ 3.546e ${}^{10}$	$-$ 8.559e ${}^{9}$	8.225e ${}^{9}$	1.322e ${}^{12}$
Coefficients		Estimate	Std.Error	$t$ value	Pr ( $>\|t\|$ )
	(Intercept)	5.302e ${}^{11}$	4.814e ${}^{12}$	0.110	0.913
	Diameter	1.188e ${}^{10}$	3.256e ${}^{10}$	0.365	0.717
	Aver_degree	$-$ 3.066e ${}^{11}$	2.404e ${}^{12}$	$-$ 0.128	0.899
	Aver_closeness	7.524e ${}^{11}$	1.137e ${}^{13}$	0.066	0.947
	Aver_betweenness	1.449e ${}^{10}$	1.712e ${}^{10}$	0.846	0.401
	Density	NA	NA	NA	NA
Residual Std error	1.767e ${}^{11}$ on 62 degrees of freedom
Multiple R-squared	0.08537
Adjusted R-squared	0.02636
F-statistics	1.447 on 4 and 62 DF, $p$ -value: 0.2294

5. Conclusions and challenges

The approach of using both the information about the network structure and the attributes of the nodes in the clustering process proved to be feasible. It enabled the creation of clusters/subgroups that are not only densely linked, but also class-uniform, in terms of the target class that describe those vertices. A characteristic of interest is defined beforehand, as a target class, that allows to obtain clusters/subgroups based on a class label.

The application of supervised clustering on attributed networks seems to be a promising technique. Atzmueller [30] applied the alternative Subgroup Discovery methodology on attributed social networks. For this purpose, the principles of Subgroup Discovery have been adapted to the dyadic network setting, detecting compositional patterns and capturing subgroup of nodes, estimated by a quality measure. The subgroup discovery was implemented on the EGR networks with the SD-MAP algorithm, using the preprocessing of COMODO algorithm, that combines the graph structure and the descriptive information of the vertices.

On the side of supervised clustering approach, the SRIDHCR algorithm proposed by Zeidat and Eick [9], was adapted to consider both structural and compositional characteristics of the EGR network. Moreover, the original algorithm was also adapted for the implementation on categorical variables, through a variation of the k-means, known by k-modes.

In a preliminary analysis of the outputs produced by both methodologies, it was concluded that Subgroup Discovery produced better clusters/subgroups, with higher overall quality, in comparison with SUWAN. However, Subgroup Discovery achieved better results due to the lack of nodes grouped, and by allowing an overlapping of nodes between subgroups. On the other hand, the main focus of Subgroup Discovery is to find subgroups of nodes, described by patterns and with a determined quality measure.

The SUWAN method also produced quite good results, with high-level cluster purity among the studied cases. This method groups all nodes of the network into clusters, contrary to Subgroup Discovery (SD).

Regarding the computational time of the presented methodologies, SUWAN proved to be less efficient, when presenting the solution. This is due to the iterative process that the SUWAN algorithm uses when searching the solution space, that is highly influenced by the size of the network. On the other hand, SD is not affected by the size of the network.

The focus of the work was to obtain class-uniform clusters, based on the EuroGroups LEUs turnover class, using SUWAN. The analysis of results allowed to verify certain patterns in the nodes that compose the clusters. Clusters with the majority class of turnovers that range between 1 and 2, are formed by LEUs that employ less persons. Similarly, clusters with class labels of turnover ranging between 5 and 6, are assembled by legal units with size classes of higher levels. Therefore, the turnover is clearly affected by the size of the legal unit.

Additionally, the analysis on the network topology impact on performance proved that there is no significant evidence of a relationship between the total turnover variable and network topology measures of diameter, average degree, closeness and betweenness.

Furthermore, this study revealed that SUWAN involves certain challenges when applied to attributed networks. One of the challenges is the parameterization of variables that influence the clustering output. For example, the importance of the network topology, established by $\alpha$ , has a strong influence and it can provide several different outcomes. Also, the developed methodology works on representative-based supervised clustering, that randomly choses the first $k$ set of representatives. Although this process allows to explore the solution space, the clustering process is still compromised by this randomness. Another challenge is the evaluation method. Evaluating the quality of a clustering is challenging, as the correct clusters are not known. Also, the proposed evaluation method focus more on nodes’ attributes, since it evaluates the overall quality based on the cluster’s purity. This may result in circumstances where certain algorithms perform better in terms of network topology but worse in terms of node characteristics, making it difficult to determine which method performs better in the long run.

As future developments, we aim at developing new evaluation methods that can be based on a better trade-off between network topology and node characteristics. In addition, since this paper dealt with undirected networks, it is also important to develop measures to deal with directed networks, as well.

References

Ricciardi

and Rossignoli

, Inter-Organizational Relationships, Towards a Dynamic Model for Understanding Business Network Performance, Springer International Publishing, 2015.

Ebers

, The formation of inter-organizational networks, Oxford University Press, 1999.

Oliveira

and Gama

, An overview of social network analysis, WIREs Data Mining and Knowledge Discovery 2 (2012), 99–115. doi: 10.1002/widm.1048.

Harenberg

Bello

Gjeltema

Ranshous

Harlalka

Seay

Padmanabhan

and Samatova

, Community detection in large-scale networks: A survey and empirical evaluation, Wiley Interdisciplinary Reviews: Computational Statistics 6(6) (2014), 426–439. doi: 10.1002/wics.1319.

Newman

M.E.J.

, Modularity and community structure in networks, in: Proceedings of the National Academy of Sciences 103(23) (2006), 8577–8582. doi: 10.1073/pnas.0601602103.

Jain

A.K.

, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31(8) (2010), 651–666. doi: 10.1016/j.patrec.2009.09.011.

Hoberecht

Joseph

Spencer

and Southern

, Inter-organizational networks: An emerging paradigm of whole systems change, Journal of the Organization Development Network 43(4) (2011), 23–27.

Matous

and Todo

, Analyzing the coevolution of interorganizational networks and organizational performance: Automakers’ production networks in Japan, Applied Network Science 2(1) (2017), 5. doi: 10.1007/s41109-017-0024-5.

Zeidat

and Eick

C.F.

, K-medoid-style Clustering Algorithms for Supervised Summary Generation, in: Proceedings of the International Conference on Artificial Intelligence, 2004.

10.

Helal

, Subgroup discovery algorithms: A survey and empirical evaluation, Journal of Computer Science and Technology 31 (2016), 561–576. doi: 10.1007/s11390-016-1647-1.

11.

Al-Harbi

S.H.

and Rayward-Smith

V.J.

, Adapting k-means for supervised clustering, Applied Intelligence 24(3) (2006), 219–226. doi: 10.1007/s10489-006-8513-8.

12.

INE, Statistics Portugal EuroGroups Data, available non-published data, 2020.

13.

Sinaga

K.P.

and Yang

, Unsupervised K-Means Clustering Algorithm, in: IEEE Access 8 (2020), 80716–80727. doi: 10.1109/ACCESS.2020.2988796.

14.

Jain

A.K.

Murty

M.N.

and Flynn

P.J.

, Data clustering: A review, ACM Computing Surveys 31(3) (1999), 264–323. doi: 10.1145/331499.331504.

15.

Hewapathirana

I.U.

, Change detection in dynamic attributed networks, WIREs Data Mining and Knowledge Discovery 9(3) (2019). doi: 10.1002/widm.1286.

16.

Vieira

A.R.

Campos

and Brito

, New contributions for the comparison of community detection algorithms in attributed networks, Journal of Complex Networks 8(4) (2020). doi: 10.1093/comnet/cnaa044.

17.

Eick

C.F.

Zeidat

and Zhao

, Supervised clustering – algorithms and benefits, in: 16th IEEE International Conference on Tools with Artificial Intelligence, 2004, pp. 774–776. doi: 10.1109/ICTAI.2004.111.

18.

Pandit

and Gupta

, A comparative study on distance measuring approaches for clustering, International Journal of Research in Computer Science 2(1) (2011), 29–31.

19.

Kaufman

and Rousseeuw

P.J.

, Clustering by means of medoids, in: Proceedings of the Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel, Switzerland, 1987, pp. 405–416.

20.

Gan

Huang

Luo

and Gao

, On using supervised clustering analysis to improve classification performance, Information Sciences 454-455 (2018), 216–228. doi: 10.1016/j.ins.2018.04.080.

21.

Finley

and Joachims

, Supervised k-Means Clustering, Computing and Information Science Technical Reports, Cornell University Library, 2008.

22.

Shi

Fang

Kong

and Yin

, Semi-supervised Co-Clustering on Attributed Heterogeneous Information Networks, Information Processing & Management 57(6) (2020), 102338. ISSN 0306-4573. doi: 10.1016/j.ipm.2020.102338.

23.

Kohonen

, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43(1) (1982), 59–69. doi: 10.1007/bf00337288. S2CID 206775459.

24.

Atzmueller

Doerfel

and Mitzlaff

, Description-oriented community detection using exhaustive subgroup discovery, Information Sciences 329 (2016), 965–984. doi: 10.1016/j.ins.2015.05.008.

25.

Herrera

Carmona

C.J.

González

and Del Jesus

M.J.

, An overview on subgroup discovery: Foundations and applications, Knowledge Information Systems 29(3) (2011), 495–525.

26.

Atzmueller

, Subgroup discovery, WIREs Data Mining Knowledge Discovery 5(1) (2015), 35–49. doi: 10.1002/widm.1144.

27.

Atzmueller

and Lemmerich

, Fast Subgroup Discovery for Continuous Target Concepts, In: J. Rauch, Z.W. Raś, P. Berka, T. Elomaa (eds) Foundations of Intelligent Systems, ISMIS 2009, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, Vol. 5722, 2009. doi: 10.1007/978-3-642-04125-9_7.

28.

Eurostat, Business Registers Recommendations Manual, Methodologies and Working papers, Publication Office of the European Union, Luxembourg, 2010.

29.

Chavent

Kuentz-Simonet

Labenne

and Saracco

, ClustGeo: An R package for hierarchical clustering with spatial constraints, Computational Statistics 33(4) (2018), 1799–1822. doi: 10.1007/s00180-018-0791-1.

30.

Atzmueller

, Compositional Subgroup Discovery on Attributed Social Interaction Networks, in: L. Soldatova, J. Vanschoren, G. Papadopoulos, M. Ceci (eds) Discovery Science, Lecture Notes in Computer Science, Springer, Cham, Vol. 11198, 2018. doi: 10.1007/978-3-030-01771-2_17.

SUWAN: A supervised clustering algorithm with attributed networks

Abstract

Keywords

1. Introduction

2. Related work

Table 1 Summary of basic concept of EGR database

4.2 Network topology impact on performance

Table 10 Summary on essential network topology measures used to study the impact on the organizational performance

References

Table 1
Summary of basic concept of EGR database

Table 10
Summary on essential network topology measures used to study the impact on the organizational performance