The Best of SIAM Data Mining 2020

Abstract

This is a special issue of best of SIAM Data Mining (SDM) conference 2020. SDM 2020 received 388 submissions and accepted 75 articles, with an acceptance rate of 19.3%. Among these 75 articles, 10 articles with highest ratings were selected by the PC chairs and invited for the special issue. This issue comprises six articles divided into two categories—graph-based learning and adaptations and fundamental research in big data. The first three articles consider temporal graphs and their adaptations, whereas the final three articles address more diverse topics inside big data research.

Pattern detection from vertex-colored temporal graphs is an extremely important problem in mining temporal data. It easily finds applications in recommending tours for tourists or detecting abnormal behavior in a network of financial transactions. The article titled “Finding Path Motifs in Large Temporal Graphs Using Algebraic Fingerprints” introduces an algebraic–algorithmic framework based on constrained multilinear sieving to address the problem. The proposed solution scales to massive graphs with up to a billion edges, and the implementation is publicly available for reproducing the results.

Temporal graph data are now ubiquitous and are certainly present everywhere, for example, social networks. There is a significant challenge in modeling the dissemination information in these graphs—for instance, the spread of fake news. Standard supervised learning methods are not sufficient on such evolving graphs. To this effect, the article titled “Classifying Dissemination Processes in Temporal Graphs” lifts the graph neural networks to temporal domains in a seamless manner. It demonstrates clearly that the temporal information is crucial for modeling dissemination. The results demonstrate that the proposed approach indeed scales for large graphs.

Incorporating domain knowledge in learning has long been a cherished dream of AI and ML. The article titled “Classifying Dissemination Processes in Temporal Graphs” addresses one of these goals by incorporating medical knowledge graph as an internal information of a patient in disease modeling. It employs a graph neural network wherein the information from electronic health record (EHR) is fused with the domain knowledge in the form of domain knowledge graphs. The resulting spatiotemporal model is evaluated on two real clinical data sets—the publicly available MIMIC II data set and a private EHR data. The results clearly demonstrate the superiority of incorporating domain knowledge in learning a robust model for clinical diagnosis.

Time-series classification is significantly different from traditional classification due to the variations in time shapes and attribute orderings. Consequently, shapelets have become an effective tool for modeling such data. The article titled “LTSpAUC: Learning Time-Series Shapelets for Partial AUC Maximization” proposes a joint approach for learning shaplets along with a classifier that optimizes AUCROC. The resulting powerful framework is shown to reduce the algorithmic complexity to being linear in time-series length. It also empirically demonstrates the superiority of the proposed approach on imbalanced time-series data sets that naturally occur.

Tensor decomposition is an effective tool in multiaspect data analysis and has a large number of practical applications. However, selecting the number of latent factors remains a challenging problem. Existing solutions are mostly based on heuristics or rely heavily on domain expertise. The article titled “NSVD: Normalized Singular Value Deviation Reveals Number of Latent Factors in Tensor Decomposition” introduced a novel method, namely normalized singular value deviation (NSVD), to fundamentally address the problem with principled theoretical foundations. It is based on the variance of the singular values of the Khatri–Rao product formed by the PARAFAC factor matrices. In addition, an efficient compression scheme is developed so that NSVD can be applied to extremely big tensors.

In the lines of incorporating domain knowledge, the article titled, “Physics-Guided Deep Learning for Drag Force Prediction in Dense Fluid-Particulate Systems” considers rich physical knowledge from the domain to accelerate learning. Specifically, it develops a deep learning model that incorporates physics-guided structural priors and physics-guided aggregate supervision to model the drag forces acting on each particle in a computational fluid dynamics–discrete element method. The resulting evaluation in drag force prediction demonstrates the effectiveness of the proposed approach.

Collectively, the six articles cover different aspects of the best of SDM 2020. They address fundamental problems in machine learning and data mining while adapting them to real and relevant data sets.

The Best of SIAM Data Mining 2020

Abstract

Footnotes

Abbreviations Used