Abstract

Dear Colleague: Welcome to volume 22(4) of Intelligent Data Analysis (IDA) Journal.
This issue of the IDA journal, the third issue of our twenty second year of publication, contains 12 articles representing a wide range of topics related to the theoretical and applied research in the field of Intelligent Data Analysis.
The first two articles of this issue are about various aspects of data engineering. Gong et al. in the first article of this group argue that active learning has proven to be effective in machine learning and despite the lower labeling cost of active learning, it has been shown that it still cannot reach state-of-the-art performance on several classification tasks due to its sensitivity to the initial state. The authors propose a novel algorithm to improve the performance of active learning and its robustness to initial state. Their experiment on several benchmark binary classification datasets has shown the proposed algorithm outperforms other active learning methods in accuracy and robustness. Lee et al. in the second article of this group discuss the topic of customer-voice data classification and argue that noisy data would have a negative effect on the classification task. The authors propose an advanced novelty detection method that is based on utilizing a class vector that possesses high cosine similarity with words to effectively discriminate between classes. Their experiments verify the propriety of the proposed method with qualitative observations, and the application of the proposed method with quantitative experiments verifies the representational effectiveness and classification performance of customer-voice data.
The next seven articles are about various forms of unsupervised and supervised machine learning methods and their applications in IDA. Dundar and Korkmaz in the first article of this group discuss various aspects of data clustering (an unsupervised machine learning approach) and propose a method that is based on the formation of clusters in a cellular automata by the interaction of neighbourhood cells. In this approach, the data points are mapped to fixed cellular automata cells, and the clusters are formed in a parallel fashion. Their experiments show that no distance calculation is used during the procedure and it is possible to cluster huge datasets within a reasonable amount of time with the proposed method. Ye and Sakurai, in the fourth article of this group argue that the similarity measure for complex data may not precisely reflect the true data structure and propose a novel spectral clustering method which measures the similarity of data points based on the adaptive neighbourhood in kernel space. To validate the efficacy of the proposed method, the authors perform experiments on both synthetic and real datasets in comparison with some existing spectral clustering methods. The experimental results demonstrate that the proposed method obtains quite promising clustering performance. Meng and Pu in the next article of this group argue that most time series clustering works focus on clustering algorithms and similarity measures where recently a u-shapelet-based time series clustering method has been proposed which not only can hold a high performance of clustering but also offers an acceptable interpretation of clustering result. The authors propose a Random Local Search algorithm that reduces the time to discover u-shapelets, meanwhile keeps or even improves the quality of clustering. Their extensive experiments on 27 UCR time series datasets demonstrates improved clustering accuracies over existing approaches. Cabrera et al. in the next article discuss that many real-world situations constantly generate unbounded concept-drifting data streams from non-stationary environments where these situations demand adaptive algorithms able to learn online, and to maintain the learning model updated in accordance with the most recent data target function (concept). The authors present an Online Adaptive Classifier Ensemble approach which is able to learn from concept-drifting data streams. The proposed algorithm uses a change detection mechanism in each base classifier in order to handle possible changes in the underlying target function. The alternative classifier is added to the ensemble if the drift stage is reached. The authors compare the new algorithm with various state-of-the-art ensemble algorithms for online learning. Ansari et al., in the seventh article of this group propose a new optimization approach for Apriori-based association rule mining algorithms where the frequency of items can be encoded and treated in a special manner drastically increasing the efficiency of the frequent itemset mining process. The proposed algorithm takes advantage of the encoded information to decrease the number of candidate itemsets generated in the mining process, and consequently drastically reduces execution time in candidate generation and supports counting phases. Their experimental results on real datasets demonstrate how the proposed algorithm is an order of magnitude faster than the classical Apriori approach without any loss in generation of the complete set of frequent itemsets. Liu et al. in the eighth article of this issue argue that data uncertainty is inherent in many real-world applications such as sensor data monitoring and mobile tracking. They further emphasize that mining sequential patterns from uncertain/inaccurate data, such as sensor readings and GPS trajectories, is important to discover hidden knowledge in such applications. The authors present a dynamic programming approach, called CoDP, to compute the exact probability that a pattern
And finally the last four articles of this issue are about novel methods and their applications in IDA. Zhang et al. in the first article of this group argue that existing approaches to weather forecasting have limited ability to predict future precipitation in different regions. In order to address the problem, the authors propose a big data base approach for precipitation forecasting based on deep belief nets. The proposed approach can not only learn the hierarchical representation of raw data using a highly generalized way, but also makes a more accurate description of the rule underlying different kinds of environmental factors. The authors present a set of experiments with hydrological multivariate time series data to validate the feasibility and robustness of their proposed approach. Their results show that the proposed approach is more robust than other approaches and can also improve the forecast precision. Brunello et al. in the tenth article of this issue explain a case study of evaluating the operators’ work quality in a medium-sized contact centre, and in particular, the problem of selecting the correct variables to be used in such an evaluation. Starting from a data set representative of a particular company’s range and size of activities, that allowed no usable predictive model for evaluating the skills of the agents, the authors were able to devise a reproducible methodology, along with an a-posteriori optimization process, to select the essential variables that should be used to objectively evaluate the quality of the agents’ work. The authors argue that the proposed methodology may be extrapolated and reused in other comparable contexts characterized by the measurability of the human operators’ performance. Mookiah et al. in the next article argue that news articles contain different object types such as people, organizations, statistical (numerical) information, countries, authors, or events where this creates a complex heterogeneous graph containing multi-type objects (vertices) and multi-type linkages (edges) among the objects, such as common keywords found between two news articles. This is called Heterogeneous News Graph (HNG) where the authors believe that one could use HNG to resolve the bias and visibility issues found in many news sources, as well as capture important news articles. Based on their experiments, the authors claim that HNGs will: (1) rank the expertness of an article’s author on a specific topic, and (2) identify articles of particular interest and value. The last article of this issue by Lin et al. is about a stacking model for variation prediction focusing on public bicycle traffic flow. They argue that a system which could recommend the nearest available stations for passengers whether they are looking for a docking station or a bike, is of great importance. Furthermore, monitoring the current number of docks or bikes at each station cannot tackle this problem because it’s too late to recommend the station for passengers to rent or return bikes after the imbalance has occurred. To address this issue, the authors propose an approach that integrates multiple base models which they trained by different combinations of features so that it could get better performance. They also consider traditional factors, such as temporal, spatial, historical and meteorological factors. The authors demonstrate the performance of their approach on real datasets and compare it with traditional stacking and a single model method.
In conclusion, we would like to thank all the authors who have submitted the results of their excellent research to be evaluated by our referees and published in the IDA journal. As usual, in addition to six regular issues, we will also have one special issue for 2018 which will be published at the middle of this year. We look forward to receiving your feedback along with more and more quality articles in both applied and theoretical research related to the field of IDA.
With our best wishes,
Dr. A. Famili
Editor-in-Chief
