Abstract
Social network analysis offers an understanding of our modern world, and it affords the ability to represent, analyze and even simulate complex structures. While an unweighted model can be used for online communities, trust or friendship networks should be analyzed with weighted models. To analyze social networks, it is essential to produce realistic social models. However, there are serious differences between social network models and real-life data in terms of their fundamental statistical parameters. In this paper, a genetic algorithm (GA)-based social network improvement method is proposed to produce social networks more similar to real-life data sets. First, it creates a social model based on existing studies in the literature, and then it improves the model with the proposed GA-based approach based on the similarity of the average degree, the k-nearest neighbor, the clustering coefficient, degree distribution and link overlap. This study can be used to model the structural and statistical properties of large-scale societies more realistically. The performance results show that our approach can reduce the dissimilarity between the created social networks and the real-life data sets in terms of their primary statistical properties. It has been shown that the proposed GA-based approach can be used effectively not only in unweighted networks but also in weighted networks.
Introduction
In recent years, social network modeling has become a powerful methodology for analyzing complex networks ranging from personal relationships to chemical interactions by applying data analysis, modeling, and simulation [1–3]. The most prominent application areas are epidemics [4], information flow [5], criminal network analysis [6], and explaining protein-protein interactions [7]. It can also be used as a tool for the analysis of friendships [8], hidden groups [9] or simulation of a social network [10]. While some unweighted social networks can model online communities [11], there are also weighted networks that can represent friendship and trust networks [8, 13]. Unfortunately, there are notable differences between real-life social network data sets and the existing social network models. In this paper, a novel genetic algorithm-based method is proposed to improve existing unweighted and weighted social network models so that they more closely represent real-life social networks.
Social network analysis was first used by social scientists to study relationships in small communities [1, 2]. As large data sets have become available through e-mails, LinkedIn, Twitter, Facebook, and so on, studies in the literature have focused on one-to-one interactions between people [14–16]. While some research concentrates on online social communities (LinkedIn, Twitter, Facebook, etc.), there are also several studies aiming to understand the dynamics of real-life social networks, such as friendship and trust [8, 18].
Crucial questions about network formation among individuals are being addressed by the recent and growing literature [19], for example, (i) Can someone predict which networks are likely to form when individuals have the judgment to choose their connections? (ii) How are network relationships important in determining the outcome of an interaction? and (iii) How efficient are the networks that form? Social networks are modeled as graphs, which can represent social structures, clusters and effects on interpersonal relationships. People are the nodes, and their relationships are the edges in graph-based modeling. The graphs that model social networks are generated by the evolution process, which stems from social theories such as homophily and Milgram’s experiment [17, 20].
In many studies, the edge weights are not taken into account; these models are referred to as unweighted social network models in the literature [11, 16]. Although unweighted social network models can be a useful tool for some specific problems, such as online communities, weighting the edges to represent the strength of relationships in a weighted social network model is a better way to model real-life social networks [14, 22]. One of the oldest studies in weighted network modeling is based on the strength of weak ties hypothesis [1]. In this work, the degree of overlap between two individuals is studied, and community topology and influence are analyzed. In [8], the importance of the edge weights is evaluated to search a person in an online community.
It must be emphasized that there may be serious differences between social network models and real-life data in terms of their fundamental statistical parameters, such as with the clustering coefficient of existing models and that of networks constructed with real data [14]. In this paper, a GA-based improvement model is proposed to produce more realistic weighted and unweighted social networks in terms of the basic network graph features. The input of this algorithm is a network generated from an existing social network model and some statistical parameters obtained from a real data set [22], and the output is a much more realistic weighted social network. Our solution approach relies on a unified fitness function composed of five different components: (i) average degree, (ii) clustering coefficient, (iii) degree distribution, (iv) k-nearest neighbor, and (v) link overlap. We propose two specialized crossover operators, which are called the degree-based crossover (DBX) and the node-based crossover (NBX, and Add, Delete, and Swap mutation operators. The DBX operator exchanges the friends of nodes that have the same degrees, while the NBX operator exchanges the friends of nodes for a given range of nodes. The initial population is produced based on an existing social network model [11, 21]. To demonstrate the effectiveness of our GA-based approach, an experimental study of various design parameters was performed using one of the most well known sets of weighted social network data. The performance results clearly show that our GA-based approach can dramatically improve the existing social network models.
The remainder of this article is organized as follows. Section 2 discusses social network modeling and the existing models. Section 3 includes a description of our novel GA-based social network model improvement model. Section 4 describes the experimental study and computational results of the comparisons. Conclusions are drawn in Section 5.
Social network modeling
There have been many models and link prediction methods proposed in the literature [12]. In particular, the evolution-based approach is one of the most prominent methods in social network modeling. It is based on testing hypotheses on the evolution rules for a specific network structure.
The Erdös and Rényi (random graph) [23], Watts and Strogatz (rewired ring lattice) [18], and Kleinberg (a navigable small world) [24] models are some of the best-known evolution-based techniques. Erdös and Rényi’s model is based on bounding random individuals, and this model shows a high similarity to the binomial distribution for degrees. In the 1990s, Duncan Watts and Steve Strogatz proposed a new random graph model for real-life social networks to improve the properties of small diameter and high clustering coefficients [18]. The most significant finding in this model is focused on networks of high clustering coefficients and low diameters. A recent major development in the modeling of social networks emerged with the work of Kleinberg, which shows that the Watts-Strogatz model cannot replicate the Milgram’s experiment feature of the short path [5]. Therefore, a local greedy search information algorithm is developed to construct shorter paths to satisfy the small-world phenomena. There are also link prediction approaches such as the Jaccard coefficient and the Adamic-Adar [25], preferential attachment and ensemble methods. These methods are designed to be used on existing networks, but they can be used or altered to create a new network as well. Another approach is based on exponentially damped path counts (Katz) and calculates the probability of one person introducing two people [5]. This approach indirectly satisfies Milgram’s experiment to the largest degree, as it can create short routines between nodes and increase the chance of connecting nodes. However, it can be biased and can ignore random shortcuts. These models can create random social networks, but they cannot represent all the features of these networks because they are only based on Milgram’s experiment. Social network models are categorized in the literature as (i) unweighted and (ii) weighted network models.
Unweighted social network modeling
In the literature, two evolution modeling techniques have been proposed to produce unweighted social network models: network structure-based models (NEMs) and nodal attribute models (NAMs) [11]. NEMs rely on explicit stochastic rules that consider node and link selection for adding and deleting. NAMs distribute nodes in n-dimensional space and ties nodes according to a probability distribution function that depends on a predefined distance metric.
Davidsen et al. introduced a dynamical network evolution model that aims to improve the small-world features of triadic closure and global connections [26]. It connects two neighbors of a randomly selected node and deletes a random node with a particular probability p. Another dynamical NEM based on the friend’s friend concept has been proposed by Marsili et al. [27]. This model is based on three parameters: random node connection, friend of friend connection and random deletion probabilities. This model randomly selects a node and connects it to an arbitrary node, connects it to the friend of a friend based on its probabilities, or deletes all of its friends. The model is based on triadic closure with a two-step search in the neighborhood of a node, as in [11, 27]. It requires more parameters, unlike Davidsen’s model, and it increases the complexity of parameter selection.
While dynamical NEMs introduce all nodes initially, there is another approach, a growing NEM that starts with a small number of nodes and gradually adds new nodes. Toivonen et al. introduce a growing NEM that ties secondary connections based on a uniform distribution [21]. In this model, each new node connects to a neighbor of its initial contacts as a growing NEM. Another growing NEM adds a new node to the network with probability p and ties common neighbors of a selected node with (1 - p) [28]. In this study, a linear preferential attachment is used that has a tendency to connect high degree nodes as a growing NEM.
NAMs differ from NEMs in that they generate the edges according to a probability function based on the attributes of the nodes. They consider homophily rather than the general structure of the network. Boguna et al introduce a new NAM that distributes the nodes on a uniform distribution in one-dimensional social space and generates edges between nearby nodes [29]. Similarly, Wong et al propose an NAM using the homogeneous Poisson distribution in two-dimensional social spaces [30].
Weighted social network modeling
In social network evolution modeling, there are a few studies on concentrating the strength of ties and the asymmetry of influence as a type of weighted social network modeling. This modeling approach assumes edge weights between two nodes as real numbers rather than as binary [11, 22]. The first example of a weighted social network model is by Kumpula [14]. This study is based on the connection of strong friendships (local search), random friendships (global search), random delete and revisiting links. Global and local search processes connect two nodes with a unit weight in a weighted social network model. The revisiting link process increases the strength of an existing edge between two nodes with a reinforcement parameter, and the delete process arbitrarily removes the connections of a node. This model shows some of the topological and statistical properties of real-life social networks, and it supports the strength of weak ties hypothesis of Granovetter [1]. However, it does not preserve other features such as the k-nearest neighbor, degree distribution and clustering coefficient when it is compared using Finnish phone data.
A general review of trust transition and opinion dynamics is presented by Ureña et al [31]. This paper also provides influence assessment, as part of social networks and addresses challenges on leveraging the trust-based network as an influence measure to cultivate decision making processes in complex social networks scenarios with uncertain knowledge. Another study models trust as edge weights between nodes [3]. This model simulates the trust interactions between nodes and updates the edge weights. It aims to explain and mimic the causal processes underlying Dunbar’s social brain hypothesis. Even though this model was robust against uncooperative behavior, it becomes complicated for large-scale social networks. Detecting communities in dynamic evolving networks can be accomplished by adding one node or one edge. In fact, in dynamically developing networks, new data is continually generated with subgraphs combining simultaneously. Recently, an incremental method to detect communities by handling subgraphs is presented [32].
Yuan et al verified that the trust network is a small-world network, and its topology is independent of its dynamics [33]. In [34], the dynamics of emotional communications in social networks are modeled with an agent-based simulation. Li et al addressed the signed network community detection problem from the viewpoint of evolutionary computation [35]. They proposed a multi objective optimization model based on link density and a multi objective particle swarm optimization algorithm to solve this model. The results show aggregated group behaviors for individual emotional actions, dominant emotions, and flow patterns for different emotions.
Weighted and unweighted network modeling is compared in [8]. In this paper, a new social search strategy is introduced by taking into consideration the influence of one person over another, which is an inherently asymmetric relationship and of varying strength. It concludes that weighted social network modeling can be greatly beneficial in social search.
In [22], a network of mobile phone call records is constructed, and aggregated call duration and the cumulative number of calls are used as a measure of the strength of a social tie. The data set has 7.2 × 106 nodes and 22.6 × 106 links with an average degree of 3.13, and it is one of the best-known large-scale weighted social network data sets.
The proposed genetic algorithm-based improvement model
A social network can be characterized by a set of features, such as k-nearest neighbor, clustering coefficient, degree distribution, and link overlap, and it has been shown that there may be serious differences between the features of real-life social networks and social network models [14]. However, finding the optimal solution to this problem is a computationally intensive process, and generally, it has an exponentially growing complexity on the dimension of the decision space. In this paper, instead of searching for the optimal solution, a GA-based framework is developed to produce social networks more similar to real-life data sets. Our GA-based approach creates a social network model based on an existing model in the literature, and then it improves the model considering the similarity of average degree, the k-nearest neighbor, the clustering coefficient, degree distribution, and link overlap. The resulting social networks can be used to modeling the structural and statistical properties of large-scale societies more realistically. Evolutionary algorithms are a broader class of meta-heuristics that mimic the analogy of natural evolution and are applied in many engineering and operations research problems. The GA is the most widely known type of evolutionary algorithm and was developed by John Holland in 1975 [36]. A GA generates a random population of specific size among possible solutions. Given an objective function, the fitness of the individuals is evaluated, and a pair of candidate solutions is selected for mating. Genetic crossover and mutation operators are applied to the selected parents to generate new candidate solutions that compete with the old ones to replace them in the next generation. This process continues until the specified termination condition is satisfied.

A sample of a weighted social network: (a) A graph and (b) Its string representation.
In our GA-based approach, a weighted social network model instance contains the links and their weight values, as shown in Fig. 1. It should be noted that if all the weights are equal to 1, the same representation could also be used for unweighted social networks. Possible solutions to a given problem may have a different number of links for each node. Therefore, string representation in our GA-based approach supports vertically variable-length chromosomes. Additionally, there is no order of nodes in a given solution. In this study, a generational GA is applied to generate an entirely new population in each iteration. We consider the tournament selection mechanism and the tournament size to be fixed in the experiments.
In our GA-based approach, the fitness function represents the dissimilarity between the generated social network and the real-life data in terms of some of the most important statistical network features: the average degree, the clustering coefficient, the degree distributions, the k-nearest neighbor and link overlap. The error (dissimilarity) between a given social network and a real-life data set can be calculated as in Equation (1).
A predefined number of individuals are generated at the initial population generation step. Two different strategies are followed in this step. The first is to use Kumpula’s model for 100% of the initial population. In this strategy, Kumpula’s model parameters are optimized using Toivonen’s dissimilarity objective function. In the second strategy, 50% of the individuals are created using Kumpula’s model as before, and the remaining 50% of the initial population is produced by NBX One-Bit.
Our solution approach comprises a set of variation operators. Crossover operators consider problem-specific information to carry the nodes and their links from the parents to the offspring. In our study, we consider two different crossover operators: nodal-based and degree-based crossovers. Both crossover operators are applied to two parents selected by the tournament selection method and generate two offspring as the output. Crossover operators: (a) NBX One-Bit, (b) NBX Two-Bit, (c) DBX One-Bit, and (d) DBX Two-Bit. NBX Operator: This operator is similar to the original single-point crossover operator proposed in the literature [37]. Two variants of this operator are developed. In the first variant, NBX One-Bit, one crossover point is selected randomly from the range [1, ⋯ , (n
max
- 1)], where n
max
is the maximum number of nodes in the parent solution. Then, both parents are separated at the given crossover point, and two children are created by exchanging tails as in Fig. 2a. In the second variant, NBX Two-Bit, two crossover points k1 and k2 are selected randomly by considering 1 ≤ k1 ≤ k2 ≤ (n
max
- 1). To generate two offspring, the genes between k1 and k2 are exchanged, as shown in Fig. 2. DBX Operator: This operator also has two variants. In the first variant, DBX One-Bit, a degree (d) is selected randomly from the range [1, ⋯ , (d
max
- 1)], where d
max
is the maximum degree in the parent solution. Both parents are searched until the first node whose degree equals d is found. Then, both nodes are exchanged from the crossover points, and two children are created (see Fig. 2). In the second variant, DBX Two-Bit, two crossover degrees d1 and d2 are selected randomly by considering 1 ≤ d1 ≤ d2 ≤ (d
max
- 1). Starting from the first node, this operator searches in Offspring-1 for a node with degree d (d1 ≤ d ≤ d2) and searches in Offspring-2 for a node with degree k (d1 ≤ k ≤ d2). When it finds such node, they exchange their friend lists. This search-find-exchange operation is performed until one of the parents is reached. In Fig. 2d, an example of a DBX Two-Bit operation is shown for d1 = 3 and d2 = 5.
NBX and DBX generate some new links and delete some existing links. Because of the undirected structure of the network, when a new link is created between two nodes, node a and node b, the gene of node a must contain node b and vice versa. In the same way, when an existing link is deleted between two nodes, node b must be removed from the gene of node a and vice versa. After an NBX and DBX operation, a repair procedure must be performed to satisfy this requirement. For example, in Fig. 2c, the friends of node 1 in Offspring-1 are copied from node 2 of Parent-2. Since node 2 is not connected to node 1 in Offspring-1, node 1 is deleted from the friend list of node 2. After a crossover or mutation operation, if a node’s friend list contains itself, as in node 2 in Fig. 2d, it is also deleted. By contrast, in Fig. 2c, node 1 is linked to node 3 in Offspring-1. Therefore, the friend list of node 3 must contain node 1.

Mutation operators: (a) Add, (b) Delete, and (c) Swap.
Mutation operators, shown in Fig. 3, are Add, Delete, and Swap mutation methods. In the Add mutation, one random friend is added to the friend list of the nodes with a certain degree. In Fig. 3, a random friend is added to the nodes with five degrees. In the Delete mutation, one random friend is deleted from the friend list of the nodes with a certain degree. In Fig. 3, a Delete mutation for the nodes with five degrees is shown. The Swap mutation sets two different degree values randomly (d1, d2) and starts a search for the nodes with degrees d1 nad d2. When it finds two nodes, one with a d1 degree and the other with a d2 degree, it swaps their friend lists. An example Swap mutation operation for d1 = 4 and d2 = 5 is shown in Fig. 3c. It should be noted that as with the NBX and DBX operators, the necessary repair functions must be performed after the mutation operations to maintain the undirected structure of the network. For example, in Fig. 3a, after the addition of node 4 to the friend list of node 1, node 1 must be added to the friend list of node 4. In the same way, in Fig. 3b, after the deletion of node 2 from the friend list of node 1, node 1 must be deleted from the friend list of node 2.
This section presents the performance results of our GA-based framework for weighted and unweighted social network modeling. To compare the solution quality of the algorithms, the results are reported in terms of the dissimilarity factor defined in Equation (3). The genetic algorithm is coded in MATLAB. Although the computation run times of the network-related procedures are polynomial, because of the large population of the individuals in the GA (106), the experiments take a substantial amount of time. The experiments are performed at the Parallel NVidia General Purpose Programming Graphic Processor Units Laboratory to accelerate the computational run time. In the following subsections, first, we present the details of the parameter settings. Second, the experimental design issues are investigated. Finally, the performance results of the Kumpula’s model and the proposed GA improvement model are compared using phone data from Finland.
Parameter settings
In our experiments, each algorithm is executed with a population of 10, 20 and 50 individuals and stopping criteria of 100 generations. The results presented in this section are the averages of 30 runs, where each run is performed with different instances of the same problem. Two-tuple integer-valued representation and generational replacement with elitism of 1 are considered. Different values for the crossover probability and the mutation rates are considered. The proposed method uses tournament selection with a tournament size of 2. In the new population policy, elitism-based selection (5%) and final individuals (after mutation) are used to create the next generation. The fitness function value of each generation is shown in Fig. 4 for a sample run. This figure indicates that the improvement is significant up to the 13th generation, but after that, there is only a slight change. This is why we set the maximum number of generations at 20.

Fitness value of the proposed GA on the number of generations.
The other parameters that must be set are two parameters of Kumpula’s model (pr and Δp), which are used to produce initial solutions of the proposed GA. In a local search of Kumpula’s model, a link between a node and its indirectly connected node is established with probability Δp, and each node can connect a random node with probability pr. A regression approach is performed to set pr and Δp. Kumpula’s models are produced with random parameter values, and a regression model is developed based on the average degree, average clustering coefficient, and largest component features. Equation (4) shows the regression model for the average degree.
The R2 and
Similarly, the R2 and
The R2 and
The five algorithm-specific parameters, which are population size, initial population, crossover type, mutation type, and operation percentage, are presented in Table 1. Initial population generation is performed in two different ways. Either the individuals are generated using Kumpula’s model or the One-Bit XOver generates half of them, and the remaining half is generated using Kumpula’s model. Four crossover operators and three mutation operator categories are considered in the first phase of the experimentation. We also consider three different operator ratios, which are (0.6, 0.4), (0.8, 0.2) and (0.8, 0.2). Therefore, a total of 216 tests are conducted for the first phase of the experimental study. All of the test cases are repeated 30 times.
Algorithm-specific parameters
Algorithm-specific parameters
Each combination is analyzed with the response surface analysis technique, and a summary of the results is presented in Tables 2 and 3. The results show that the model’s source of variability has been subdivided into several components. The components B, C and F represent linear and quadratic effects. The P-values indicate that F2 is significant, whereas the other interaction terms are not significant. Thus, we can remove nonsignificant terms from the model. Both the linear and second-order models look like a sufficiently good fit to data, since the value of R2 is 0.5813. It should be noted that the mean squared error in the second-order model is 4.744, considerably larger than that for the linear model.
Comparison of attributes
Anova for selected factorial models
Response surface analysis shows that the proposed GA produces the best performance when the population size is between 40 and 50, the operator percentage is between 25% and 60%, and the initial population is 10% Kumpula.
Based on the first phase of the experimental study, we determined the best values for all parameters (see Tables 3 and 4). The best results are observed when the initial population generation is performed 100% based on Kumpula’s model using 50 individuals in each population. If the improvement percentage of the best fitness values is selected as the comparison metric, the NBX Two-Bit crossover and Add mutation operator combination gives the best result. Otherwise, if the improvement percentage of average fitness values is selected, the DBX Two-Bit crossover and Delete mutation operator combination gives the best result.
Comparison of mutation and crossover operators

Comparison of GA improved weighted network, target network (Finnish phone data), and the best network produced with Kumpula’s model on (a) k-nearest neighbor, (b) degree distribution, (c) clustering coefficient, and (d) average link overlap.
The proposed GA takes a set of weighted social network models as the initial solution and tries to maximize the similarity between the statistical features of the network model and the real world weighted social network data.
In the literature, to the best of our knowledge, the Finnish mobile phone log is the only study on large-scale weighted social network data based on one-on-one social interaction [21]. Additionally, Kumpula’s study is one of the best-known weighted social network models. To evaluate the performance of our GA-based improvement approach, we take Kumpula’s model as the initial solution and produce a more realistic social network model so that its statistical features converge with the Finnish phone data.

Comparison of GA improved unweighted network, target network (Finnish phone data), and the best network produced with Kumpula’s model on (a) k-nearest neighbor, (b) degree distribution, and (c) clustering coefficient.
Figure 5 shows a comparison of the GA improved weighted network, target network (Finnish phone data), and the best network in the initial population based on a sample run. While the clustering coefficient and average link overlap are the most improved features, the degree distribution and k-nearest neighbor show slight improvement.
Comparison of the fitness function error components and the total fitness function
We also investigate the improvements performed by the proposed GA for unweighted social networks. To compare unweighted social networks, all weights are assumed to be 1. Figure 6 presents a comparison of the unweighted version of our GA approach, target network (Finnish phone data) and the best network produced with Kumpula’s model. Because of the lack of weight information, link overlap is not involved in this analysis. The comparisons show that while our GA approach can improve all three parameters, the improvement in the cluster coefficient parameter is the most remarkable.
Table 5 shows the fitness function error components and the total fitness function values for Kumpula’s model, weighted GA improvement and unweighted GA improvement. It presents the average values obtained from 30 independent runs. The weighted GA model can improve all the error components efficiently except for the average degree error. It can be stated that GA can improve weighted networks across all network parameters, while it tries to keep the average degree error almost constant. Consequently, the total fitness value can be decreased from 4.3930 to 1.7630.
On the other hand, our GA model can only decrease the k-nearest neighbor and average degree errors for unweighted networks. While it keeps the clustering coefficient error almost constant, it increases the degree distribution error. However, it should be noted that the degree distribution error is already very low compared to the other error factors. This is why the total fitness value can be decreased to 0.7665.
Social network modeling is an important issue in social network studies, and there are several different social network models in the literature. Although some of these models represent social interactions with unweighted links such as online communities, there are also various weighted models in which the link weights represent the strength of the relationships in a friendship or trust network. Whereas network models should be developed to have characteristics similar to those of real-life networks, there can be significant differences between the models and real-life data. In this paper, we proposed a novel GA-based approach to produce weighted and unweighted social network models that maximize the real-life social network data similarity in terms of some fundamental network graph features. Our GA-based approach generates a social network based on an existing model (Kumpula’s technique) in the literature and then improves it in terms of the similarity of the average degree, the k-nearest neighbor, the clustering coefficient, the degree distribution and link overlaps. The resulting improved network model is more realistic and can represent a specific real-life social network data set. Our main contributions are (1) developing a new GA-based model improvement algorithm handling the primary statistical properties of social networks; (2) identifying problem-specific heuristics for initial population generation and intelligent variation operators that comprise domain-specific knowledge; (3) conducting a comprehensive experimental study; and (4) confirming usability for both weighted and unweighted social networks. To demonstrate the effectiveness of our GA-based approach, an experimental study with various design parameters was performed using one of the best-known weighted social network data sets. The performance results clearly show that our GA-based approach can dramatically improve the existing social network models. Furthermore, our GA approach can be used to improve the initial models for not only unweighted models but also weighted networks.
