Abstract
Data collected in psychological studies are mainly characterized by containing a large number of variables (multidimensional data sets). Analyzing multidimensional data can be a difficult task, especially if only classical approaches are used (hypothesis tests, analyses of variance, linear models, etc.). Regarding multidimensional models, visual techniques play an important role because they can show the relationships among variables in a data set. Parallel coordinates and Chernoff faces are good examples of this. This article presents self-organizing maps (SOM), a multivariate visual data mining technique used to provide global visualizations of all the data. This technique is presented as a tutorial with the aim of showing its capabilities, how it works, and how to interpret its results. Specifically, SOM analysis has been applied to analyze the data collected in a study on the efficacy of a cognitive and behavioral treatment (CBT) for childhood obesity. The objective of the CBT was to modify the eating habits and level of physical activity in a sample of children with overweight and obesity. Children were randomized into two treatment conditions: CBT traditional procedure (face-to-face sessions) and CBT supported by a web platform. In order to analyze their progress in the acquisition of healthier habits, self-register techniques were used to record dietary behavior and physical activity. In the traditional CBT condition, children completed the self-register using a paper-and-pencil procedure, while in the web platform condition, participants completed the self-register using an electronic personal digital assistant. Results showed the potential of SOM for analyzing the large amount of data necessary to study the acquisition of new habits in a childhood obesity treatment. Currently, the high prevalence of childhood obesity points to the need to develop strategies to manage a large number of data in order to design procedures adapted to personal characteristics and increase treatment efficacy.
Introduction
In psychology, as in other sciences, large amounts of data are analyzed in order to draw conclusions about them. Many methods can be used to analyze these data, such as factorial analysis and logistic regression, which are classic methods, or more complex methods, such as neural networks or support vector Machines, which are nonlinear. In addition, there are the so-called visual techniques, which can provide visual intuition about data. They can be divided into two groups: univariate, which produce representations of one variable (e.g., histograms, box-and-whiskers plots, etc.), and multivariate, which try to show the relationships among several variables (e.g. parallel coordinates, self-organizing maps [SOM], etc.)
Parallel coordinates is a classic technique used for visualizing high-dimensional space (Zhou et al. 2008), and it is quite useful when the data set is small but not when it deals with thousands of observations and dozens of variables. For these cases, the SOM algorithm is preferred due to its high power of synthesis. The SOM representation technique has been used in different fields, such as medicine (Cattinelli et al. 2012; Chang & Teng, 2007; Rosado-Muñoz et al. 2013), computer security (DeLooze 2004; Pachghare, Kulkarni, and Nikam 2009), engineering (Frey 2012; Kohonen et al. 1996; Panapakidis et al. 2013), and so on.
Although SOM has been widely used in many applications, it is still used very little in psychology. This article briefly discusses the structure of this neural network model, the method for obtaining the parameters, and the way to interpret the graphs obtained. To explain SOM, a “toy” data set will be used that was generated by the authors to make initial contact with this visual tool easier. To reflect the potential use of this tool in psychology, a case in this field is also presented, using data from a study on a childhood obesity treatment. Changing behavior is the target of the most efficient obesity interventions, and for this reason, data related to childhood obesity have been chosen for this article as an example of applying the SOM to analyze “self-monitoring” data.
Childhood obesity is one of the greatest public health problems in the twenty-first century (Branca, Nikogosian, and Lobstein 2007). Dietary and physical activity self-records are referred to as the “cornerstone” of behavioral weight control programs because they are considered useful techniques to collect information on patients’ behaviors and the acquisition of new habits and to evaluate the treatment effects (Baños et al. 2009, 2011; Oliver et al. 2013). Specifically for childhood obesity treatment, dietary self-records usually include the registering of the type and amount of food and beverages, social situations and the place where the intake occurs, and the emotions related to the intake moment. For the physical activity self-register, the most common variables registered are the type of exercise done, the time of the implementation, and the subjective fatigue associated with the activity. People in treatment record these variables several times a day in order to analyze their progress in modifying their habits. Adherence to self-register procedures is an important indicator of successful weight management. A consistent relationship has been found between self-register adherence and success in both losing weight and maintaining weight loss (Baker and Kirschenbaum 1998; Collins, Kashdan, and Gollnish 2003). However, the large amount of data recorded throughout the treatment makes it necessary to use analysis procedures that can take many variables into account. SOM analysis techniques allow a global view of all the data.
Three hypotheses were formulated in the present study. SOM techniques will be useful for exploring the data collected through self-record systems in a childhood obesity treatment. SOM techniques will be able to analyze the adherence to self-records in a childhood obesity treatment. SOM analysis will be able to analyze the dietary intake and level of physical activity in children who completed a treatment focused on modifying their habits to control their weight.
This article is organized as follows: In the Method section, the basic elements of the SOM algorithm are explained, and an example of the use of this technique is presented using a synthetic data set. Next, the SOM algorithm is applied to a real problem (childhood obesity) in order to extract useful information from the data. This section will describe the participants recruited and the measurements used to analyze the treatment effects. The Results section will focus on the presentation and explanation of the data obtained through the application of SOM analysis in the specific example of childhood obesity treatment. Finally, general conclusions will be summarized in the Discussion section.
Method
SOM
SOM is a specific artificial neural network (ANN) whose purpose is to find groups of patterns (clustering) and compress the information from high-dimensional data into geometric relationships on a low-dimensional representation that allows the visualization and correlation of complex patterns. An ANN is a computational model based on human brain functioning. The main characteristic of an ANN is its ability to acquire information from the environment and improve its performance, taking into account a prescribed model that constitutes the learning paradigm (Haykin 2009).
In a SOM algorithm, the neurons (or nodes) are ordered in two layers: the input layer (composed of N neurons, one for each input variable) and the competition layer (also called the second layer, composed of a topological low-dimensional grid of neurons—usually two-dimensional—geometrically ordered). Each input layer neuron is connected with every unit on the competition layer, and, subsequently, an N-dimensional weight vector is assigned to each competition layer unit. Thus, a set of observations is associated with each second layer unit. The algorithm finds the sets that best describe the observation domains. The sets are arranged on the two-dimensional grid, so that similar sets are closer to each other than to different ones (Kohonen et al. 2001).
Two choices have to be made: the map type (hexagonal or rectangular grid, which indicates the neighborhood relation or topology) and the number of neurons (which defines the size of the low-dimensional grid). These choices depend on the size and dispersion of the input data (Vesanto et al. 1999). Then, a learning algorithm is used to calculate the associated weights for each competition layer neuron. The neuron whose weight vector is closer to the input observation is called the winning neuron or best-matching unit (BMU).
Once the map training is finished, the two-dimensional map can be visualized. It is also called the “components plane,” and it provides qualitative information about how the input variables are related to each other for the given data set. Furthermore, a “hits map” can be prepared. It represents the number of times each map unit, or neuron, was the BMU for each input register, so that the distribution of the BMU for a given set of data is represented. This information gives an idea of the number of input observations gathered in each neuron, which makes it possible to compare the strength of each unit on the components plane.
The SOM algorithm provides qualitative information to establish relationships between variables, and this is not possible with other methods. Given the difficulty of analyzing the data set of the present study, SOM was proposed to extract knowledge. Specifically, SOM analysis allows a large amount of data to be summarized in an easily interpretable graphical representation. Moreover, SOM analysis provides an unsupervised modeling approach, meaning that no a priori hypotheses need to be formulated by users at the beginning of the study. These characteristics make it possible to obtain unbiased results, and unanticipated relationships between different variables can freely emerge.
Obtaining useful information in a relatively complex multidimensional database is extremely straightforward when using this model and applying this interpretation methodology. For this reason, SOM analysis is the preferred procedure for our study due to its ability to provide a compact and unbiased representation of multiple data. Indeed, SOMs provided greater insight about the acquisition of new habits, and they highlighted unpredicted existing relationships.
Analyzed Problem: The Childhood Obesity Treatment
The problem to be addressed is the analysis of data from adolescent and preadolescent children suffering from overweight and obesity, who were receiving a weight loss treatment focused on developing healthy eating habits and increasing their daily physical activity. There were two treatment conditions with the same components: (a) treatment supported by a web platform and (b) traditional treatment (only face-to-face; for a more detailed description, see Baños et al. 2009, 2011, 2013; Oliver et al. 2013). The only difference between the two conditions is the use of the web platform. The final goal in both conditions was to increase the consumption of healthy foods, reduce the consumption of unhealthy foods, and increase the level of physical activity in order to gain control over their excess weight and promote healthier habits. Moreover, we expected that children in the condition supported by the web platform would show greater results in modifying their habits than children receiving the traditional face-to-face treatment.
Participants
The total sample was composed of 47 children (32 boys) ranging from 8 to 13 years old (X: 10.48; SD: 1.56), 25 participants in the traditional treatment and 22 participants in the treatment supported by the web platform. The sample was recruited from a Child and Adolescent Cardiovascular Risk Unit. No significant differences were observed in the groups based on sex and age. The body mass index mean was 28.8 (SD = 3.59). Regarding weight, z-scores adjusted for sex and age were calculated, with z = 2.75 (SD = 0.29).
Variables and Measurements
Dietary and physical activity self-registers were used. Dietary self-registers included the type and amount of food and beverage, the social situation and place where the intake occurs, and the emotions related to the intake moment. Physical activity self-registers included the type of exercise, the time of implementation, and the subjective fatigue associated with the activity. Children in traditional treatment compiled the self-register weekly using a traditional “paper-and-pencil” procedure, while children in the treatment supported by the web platform compiled the self-register using an electronic personal digital assistant (PDA; for a more detailed description, see Oliver et al. 2013). Self-register variables were the same in both procedures.
The study included nine variables to be analyzed: treatment condition, adherence to the self-register, fruit intake, vegetable intake, carbohydrate intake, protein intake, dairy intake, fast food intake, and physical activity.
The first variable, “condition,” refers to the treatment condition to which participant is randomly assigned. The variable “adherence” indicates the amount of eating information and physical activities that children include in the self-register during the complete treatment. Treatment was 10 weeks long. During this period, children were asked to introduce information about eating (breakfast, lunch, dinner, and snacks consumed) and the physical activities that they carried out in their daily lives. Therefore, this variable ranged from zero (not register) to infinity, depending on the number of foods consumed and registered and the activities practiced.
In order to analyze the progress in the food treatment, information about different nutritional groups was registered: fruit consumption (“FS” variable); vegetable consumption (“VS” variable); carbohydrate consumption (“HCS” variable; e.g., bread, cereal, and pasta); consumption of meat, fish, legumes, and eggs (“Prot” variable); dairy consumption (“LS” variables, e.g., milk, cheese, and yoghourt); and consumption of fat foods (“Gr” variable, e.g.; chocolates, cakes, sweets, fast food, chips, sweet beverages, etc.). Physical activity (“AF” variable) was also analyzed as the number of activities the children carried out in their daily lives (e.g., walk, run, play football, dancing, skating, basket, jumping, etc.).
Results
SOM was implemented using a free Matlab Toolbox that is available in http://www.cis.hut.fi/projects/somtoolbox/download/
To train the SOM, several parameters were varied: Initialization: Two types of initializations were used: (a) random: The initial weights follow a normal distribution (with a zero mean and variance equal to one) and (b) based on principal components analysis (Kohonen 1982). Training algorithm: Two types of algorithms were used: (a) batch: The adjustment values of the different weights are accumulated across all the training items and (b) online training: Weights and bias values are adjusted for each training item.
To obtain the best SOM, two error measures were used: (a) topographic error, measures the ability of SOM to hold neighborly relations between original patterns and the SOM projections and (b) quantization error, gives an idea of the SOM’s accuracy in modeling the data. Among the different versions of the trained SOM, the best version was the one with the minimum product quantization error—topographic error.
SOM analysis was carried out, and the results are summarized in this section in order to find out the distribution of participants with regard to the dietary and physical activity self-registers. Figure 1 shows the distribution of the participants over the whole map. Figure 2 shows the distribution of the participants according to the variables analyzed: two conditions (traditional vs. treatment supported by a web platform), the adherence to the self-records, intake consumption (fruits, vegetables, carbohydrates, protein, dairy, and fats), and physical activities implemented. The conclusions obtained by observing these figures are summarized below.
The variable “adherence” is more common in participants who perform fewer “activities” in their daily lives.
The variables “FS” and “LS” seem to be highly correlated; that is, participants who usually eat more fruit also eat more milk, cheese, and yogurt.
The variables “HS” and “VS” are highly correlated; that is, participants who usually eat more carbohydrates are likely to eat vegetables.
Regarding “fat foods,” participants in the traditional condition show high variability, as some of them usually eat many fat foods, whereas some of them do not. However, participants in the treatment supported by a web platform condition do not usually eat fat foods.
The variable “Prot” does not seem to be correlated with any other variable.

Winners map; this map shows the distribution of the participants along the whole map. This map shows projections corresponding to the different input features used to train the self-organizing maps.

Components map obtained with self-organizing maps algorithm for the synthetic problem data. The colored area inside each hexagon is proportional to the number of input patterns that are more similar to this neuron.
Regarding “adherence,” the values obtained confirm participants’ low adherence to self-register tasks, and this low adherence is even lower for participants in the web platform condition. One possible explanation has to do with the technical problems with the PDA system. Qualitative information provided by children and parents indicated technical difficulties, such as rapid battery discharge, problems with the login and password, or the difficulty of introducing dietary and physical activity information using an electronic pencil. These problems could impede the implementation of electronic self-registers and shows the importance of making improvements in the devices and software in order to increase participants’ adherence.
Discussion
Studies about health behaviors and life styles usually manage a large number of different variables (physical, emotional, cognitive, sociocultural, etc.), and they have to extract multiple data for their analysis. Thus, it is necessary to have efficient strategies to analyze and interpret these data from a multivariate perspective (Astel et al. 2010). The use of visual data mining emerges as a technique that is able to provide high-quality information about the results obtained and the possibility of designing and developing strategies adapted to personal and clinical characteristics (Wickramasinghe et al. 2011). Visual data mining, and particularly the SOM technique, is a new procedure that takes into account all the variables registered during the treatment and considers these data globally in the analysis procedures. SOM analysis has been used in several medical studies. For example, Wickramasinghe et al. (2011) and Astel et al. (2010) showed that SOM techniques were useful for identifying patterns in patients with diabetes, and these results increased the possibility of developing specific strategies to manage these patients efficiently. SOM analyses have also been used to evaluate the efficacy of a specific screening for assessing the presence of infection in people (e.g., Sun et al. 2011), showing better results than linear discriminant analysis. Other studies have shown the utility of SOM in analyzing the satisfaction of patients in nursing science (Voutilainen et al. 2014), identifying groups of children with different risk profiles for growth development (Schilithz et al. 2014), or obtaining a deeper understanding of ventricular fibrillation (Rosado-Muñoz et al. 2013). However, to the best of our knowledge, no previous study has focused on the use of SOM analysis in childhood obesity treatment.
The final objective of the present study was to obtain a visual representation, in a simple and comprehensive manner, of the relationship between intake and physical activity variables during a treatment focused on modifying habits. Taking into account the hypotheses formulated, SOM techniques were viable and efficient procedures for exploring the data from self-records and analyzing the adherence to these procedures. Moreover, SOM techniques facilitated the understanding of the relationship between the specific treatment implemented (traditional vs. web platform) and its effects on the acquisition of new healthier habits. In this sense, the results showed that children who completed the web platform condition treatment increased their consumption of healthy foods and reduced their intake of unhealthy food (e.g., fat foods). This information is relevant in designing specific therapeutic strategies to modify lifestyle behaviors.
One of the main limitations of the present study is the low adherence to the self-record procedures, specifically when PDA systems were used. As mentioned above, technical problems can explain this low adherence, such as rapid battery discharge and the difficulty of introducing information using an electronic pencil. The objective of the electronic self-record techniques is to offer a comfortable procedure where specific information can be introduced in an easy and noninvasive way. The final objective is to facilitate the participants’ arduous task of introducing the information needed to analyze the treatment progress and make therapeutic decisions. Therefore, our data suggest the importance of developing more ergonomic and easier to use electronic procedures for self-recording behaviors, adapted to the children’s lifestyle.
Despite the limitations, this study shows that the SOM is a technique with great potential in this field. Interventions on lifestyle always involve the daily monitoring of many variables because it is assumed that their analysis provides very useful and enriching information. However, until now, much of this information could not be analyzed in a simple way using traditional statistical methods. The SOM provides an easy way to extract information from a lot of data, and we think it will be very useful for analyzing self-monitoring data in psychological interventions.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded in part by the Spanish Ministry of Education, Culture and Sport, Projects ACTIOBE (PSI2011-25767), Excellence in Research Program PROMETEO II (Generalitat Valenciana. Conselleria de Educación, 2013/003), and CIBER Fisiopatología de la Obesidad y la Nutrición (ISC III CB06 03/0052).
