Abstract

Almost every scientist and engineer suffers under the weight of high dimensionality in the era of big data. The more data and higher the number of dimensions, the harder it is on one hand to observe the data by visual inspection and on the other hand to solve the problem. A key technology to support the handling of such problems is machine learning. So, when I received this book to review, Visual Knowledge Discovery and Machine Learning, it showed promise for potential suggestions on how to deal with big data.
My first move usually when I start reading a scientific book is to go through its table of contents. The first impression I got from the structure of the book was that it is not equally split between visual knowledge discovery and machine learning as expected from the title. Instead, the book focuses mainly on visual representation issues and has only a few chapters on topics related to machine learning or visual knowledge discovery using machine learning. Some methods to handle high-dimensional data visualisation and to identify complex n-Data patterns are discussed. However, the book mainly specialises on the General Line Coordinates (GLCs) methodology, as most of the chapters are presenting case studies that were solved with GLCs.
The first chapter provides a nice introduction of the big data and high dimensionality visualisation problem along with state-of-the-art methods. Although this chapter is quite clear, I felt that a running example using all referenced methods in the introduction would provide the reader with a deeper understanding of the fundamental differences explained verbally. After the introductory chapter, two chapters cover the definitions of the different types of GLCs (reversible and non-reversible) and present their mathematical foundations. Chapter 4 explains how to deal with the issues of occlusion and pattern simplification of the GLCs when the dimensionality is high and visualisations are hard to interpret. Then, a chapter filled with case studies of big variance (among which a traditional classification problem on the Iris data set from the UCI repository; see UCI Machine Learning Repository, Fisher, 1936) follows to put into context the knowledge acquired on GLCs. I found it helpful that case studies run along the different chapters and provide paradigms on the different topics; for example, on feature extraction, selection and knowledge discovery. Moreover, the book has a chapter on the introduction of a virtual data scientist, which is an automation of processes that a data scientist usually follows with the aim to overcome the limitations of the occluded information. Finally, the book concludes with a brief comparison of a couple of methods that were not mentioned earlier and suggests future research.
For me, one of the most interesting chapters was Chapter 11, where the dimensionality reduction problem was formulated as a multi-objective optimisation problem. The proposed methodology was to construct the Pareto front using GCL-L and interactively selecting appropriate weights. More case studies in this chapter would have been helpful. I would have preferred to see a chapter with all related work and compared methods to GLCs in the beginning, after the introductory chapters. In addition to this, I think that the virtual scientist along with the future research directions would have been a very good concluding chapter for this book.
Last but not least, I would like to comment on the general presentation of the book. The length of the chapters is quite balanced with each chapter building on the knowledge of the previous chapter. The language in the book is smooth, although some chapters (e.g., Chapter 9) are quite brief and remind me of conference-style writing, which is usually very compressed due to page length limitations. A major drawback in terms of presentation aspects that dampens the reading enthusiasm is that many of the figures in the book are of low quality: Many figures are blurry and in a few cases almost illegible (e.g., Figure 6.1, p. 142). In some cases, it looks like the resolution of the source figures is poor like it was acquired by a screen shot (e.g., Figure 9.2, p. 259), which of course might be a weakness of a software tool not exporting the figures in vector file formats (.tif, .pdf or .svg). In other cases, I get the impression that the figures were designed in a Microsoft Office environment (probably in PowerPoint) using the shadow effect which is generally suitable only for projected presentations. I would recommend to consider revising the figures of the book. Some typos have been spotted, although this is a minor issue that I am sure will be addressed soon.
What can Perception readers expect from this book? The book addresses big data handling and analysis aspects. Particularly, it provides an alternative method to principal component analysis for the handling of high-dimensional visualisation of data and feature selection. The case studies presented, although derived from different fields, could be used to draw inspiration. Generally, the book is a good suggestion for a data scientist or someone who would like to specialise on GLCs rather than the typical Perception reader, though it provides a helpful introduction along with a wide variety of case studies that help any scientist to familiarise with this method. Regarding value for money, I feel that the book is a bit overpriced for its content, the overall presentation and the number of pages.
