Editorial

Abstract

Dear Colleague: Welcome to volume 21(3) of Intelligent Data Analysis (IDA) Journal.

This issue of the IDA journal, containing 14 articles, represents a combination of 9 regular articles as well as 5 papers from MCPR (Mexican Conference of Pattern Recognition) 2015. These articles cover a wide range of topics related to the theoretical and applied research in the field of Intelligent Data Analysis.

The first three articles of this issue are about various aspects of data preprocessing and classification. In the first article, Schclar et al. present an approach for the construction of ensemble classifiers based on dimensionality reduction in which ensemble members are trained based on dimension-reduced versions of the training set. The authors also present a multi-strategy ensemble which combines AdaBoost and Diffusion Maps. In their comparison that is made with the Bagging, AdaBoost, Rotation Forest ensemble classifiers, using a large number of benchmark datasets, the authors’ results show that their proposed algorithms were superior in many cases to other algorithms. Vinh and Anh in the second article of this issue argue that in time series classification the one nearest neighbor with Dynamic Time Warping measure in most cases outperforms more advanced classification algorithms. They further emphasize that instance reduction is one of the approaches to improve time and space efficiency of nearest neighbor classifier for time series data. The authors propose a two step approach for instance reduction in time series classification. The main idea behind this method is that if one can compress the two time series by the Minimum Description Length principle, one can combine them into one time series. The authors empirically compare their proposed method with methods, such as INSIGHT and Naïve Rank Reduction. Their experimental results show that this method can outperform INSIGHT and Naïve Rank Reduction in many datasets. Badjiadji et al. in the last article of this group argue that One-Class Classifiers (OCC) have been widely used to learn without counterexamples and their extension for multi-class implementation offers an open scheme which allows adding new classes. However, using OCCs for the multi-class implementation usually achieves less accuracy than the usual multi-class implementations. To overcome this problem, the authors propose combining different types of OCC for multi-class classification by means a new Dynamic Weighted Average (DWA) combination rule. Their experimental results conducted on several real-world datasets prove the effectiveness of their proposed approach where the DWA rule achieves the best results against fixed rules as well as the decision template.

The next two articles are on clustering in IDA. Vu and Labroche, in the first article of this group, explain that active learning for semi-supervised clustering allows algorithms to solicit a domain expert to provide side information as instances constraints, such as a set of labeled instances called seeds. The authors further argue that active methods suffer from several limitations, such as: (i) being tailored for only one specific clustering paradigm or cluster shape and size, (ii) being counter-productive if the seeds are not selected in an appropriate manner and, (iii) require to work efficiently with minimal expert supervision. The authors propose a new active seed selection algorithm that relies on a k-nearest neighbors structure to locate dense potential clusters and efficiently query and propagate expert information. Their comparative experiments conducted on real data sets show the efficiency of this new approach compared to existing ones. Zahid et al. in the next article of this group argue that mining web usage data of e-business organizations is essential to provide knowledge about clients’ web utilization patterns and because of non-deterministic web access behavior of web clients, web user session data is usually noisy and imperfect. The authors propose a robust Fuzzy c-Least Medians (FCLMdn) clustering framework to deal with the user session data contaminated with noise and outlier user session objects. Their results indicate that quality of user session clusters formed using FCLMdn algorithm is much better than those using other algorithms in terms of various cluster validity indices.

The third group of articles in this issue are about enabling techniques in IDA. Moshki et al. in the first article of this group propose a new data-driven method for short-range forecasting of spatio-temporal systems. The approach is based on constructing forecasting models, based on several local models, where each local model is constructed in three steps. The proposed method, which is scalable, provides a reasonable trade-off between speed and precision. In addition to the scalability, this method also produces forecasts that are more accurate than competing systems. Huang et al. in the next article of this group argue that finding frequent itemsets is a critical step to discovering association rules, where the number of frequent itemsets may be huge if the threshold of minimum support is set at a low value or the number of items in the transaction database to be mined is large. The authors explain some of the shortcomings of existing methods and propose an efficient algorithm to recover each frequent itemset and its approximate frequency based on the kept maximal itemsets, frequent 1-itemsets, their supports, and some key information. Their experiments show the compression effects of the proposed algorithm. Huang et al. in the eighth of this issue explain the topic of financial distress prediction (FDP) that has received considerable attention from both practitioners and researchers and propose a novel support vector machine (SVM) classifier ensemble framework that is based on earnings manipulation and fuzzy integral for FDP. The authors use three years of historical financial data to predict companies’ current financial situation and divide the companies in each year into different categories according to whether they manipulate the earnings. To verify the performance of their approach, the authors perform an empirical study using real financial data, where their results indicate that the introduction of earnings manipulation, the new fuzzy measure determination and dynamic adjustment method to FDP can significantly enhance the prediction performance. The last article of this group by Oliveira and Pereira is about price comparison services which are widely used by e-shopping customers. Such e-shopping sites receive product offers from thousands of online stores, and in order to provide price comparison, product categorization, and searching, it is necessary to match different offers referring to the same real-world product. The authors propose a method that uses association rules to classify product offers from e-shopping web sites matching offers against offers without the need for a product catalogue. This is a supervised learning method that trains a classifier, whose generated model comprises a set of association rules to identify product offer classes. Their experimental evaluations show that their method is effective and efficient, and obtains better results than three baselines techniques in several datasets with distinct characteristics.

And finally, the last five articles are extended versions of selected papers from the 7th MCPR that was held in Mexico City in June 2015. The first two deal with the problem of feature selection for IDA. Efrén Juárez-Castillo et al. propose a new feature selection method for analysis of cognitive states. The proposed method allows dealing with spatial variability in activations without a priori defining areas of interest. The idea is defining the activity of a brain region as a weighted vote of the activity observed in their neighbors in an isotropic three-dimensional space. For testing the proposed method, the authors designed an experiment by using FMRI, which allows obtaining three-dimensional images of discrete brain regions activated by observing images of faces and buildings. In the experiments, the classification stage was made using Support Vector Machines, reaching a classification accuracy of 96%, which overcomes the accuracy of 84% obtained by a traditional feature selection method. Ángel Hernández-Castañeda and Hiram Calvo propose using a continuous semantic space model, represented by Latent Dirichlet Allocation topics, combined with other kinds of features like a word space model, syntactic information from n-grams, or dictionary-based features, in order to identify deceptive text. This work shows that selecting the appropriate set of features allows obtaining a state-of-the-art performance using a Naïve Bayes classifier. The experiments were conducted on three different corpora: one consisting of reviews about hotels; another with reviews about controversial topics; and the last one of reviews about books. The results show that merging Latent Dirichlet Allocation topics and a word space model attains the best results, with the advantage that language specific resources are not required.

Two articles are about applications of IDA techniques over real life problems. The work of Hiram Calvo et al., based on the fact that public security and crime fighting are important issues in big cities around the world, proposes a new method for forecasting criminal activity in a particular region by using a supervised classification model based on CR $+$ . Then spatial and temporal decisions made by criminals are clustered, using a density-based clustering algorithm; trying to identify hotspots where criminal activity is concentrated. Using this information, patrolling routes are designed using ant-colony optimization. The authors show that the proposed method always was able to find optimal routes in shorter time than commonly used random walk algorithms. A case study based on real crime data from Cuautitlán Izcalli, Mexico, is included. The second article by Brena and Garcia-Ceja proposes a crowdsourcing method for building personalized models for human activity recognition which combines the advantages of both user-dependent and general models by finding class similarities between the target user and the community of users. The authors present how to build a personalized model by identifying clusters of users that behave in a similar way with respect to an activity. The experimental results show the proposed method improves accuracy in recognizing human physical activities with the benefits of personalized training without the effort of a painful individual training.

In the last article an IDA technique for image preprocessing is addressed by Mújica-Vargas et al. The authors introduce an impulsive noise filter using a median re-descending M-estimator in order to increase both the capacity of noise suppression and getting better preservation of details in an image. The proposed Re-descending M-estimator controls the magnitude of the salt or pepper impulses and deletes them when it is necessary. All the results obtained after extensive simulations show that the proposed filter outperforms other recent approaches.

In conclusion, we would like to thank all the authors who have contributed to this issue of the IDA journal. Our special thanks are for the authors of the last five articles who have submitted the extended version of their papers to be included in this issue. We look forward to receiving your feedback along with more and more quality articles in both applied and theoretical research related to the field of IDA.

With our best wishes

Dr. A. Famili

Editor-in-Chief

[3pt] J.F. Martinez-Trinidad, J.A. Carrasco-Ochoa

J.A. Olvera-Lopez and J.H. Sossa-Azuela

Guest Editors