Book Review: Review of SAS Enterprise Miner Textbooks

Abstract

SAS Enterprise Miner (EM) provides a suite of tools for predictive modeling. With a visualized process flow, program users select nodes for statistical analyses, ranging from regression, decision trees, and neural networks. When combined with additional nodes for gradient boosting and text mining to open source connectivity with R, EM provides significant program scalability for analytics projects. Therefore, texts encompassing EM vary from a generalist approach to specialized treatment of a particular concept or program node. Lucid exposition of theoretical models, diagrammatic outlines of program nodes, and guidance for model selection are employed to appeal to a diverse target audience. Concerns regarding methodological rigor are not only a function of this expository approach in each related SAS Institute publication but also an indication of the limitations inherent in EM.

Elementary to intermediate users SAS Enterprise Miner

Serving as an introduction to the EM interface as well as decision trees, neural networks, and regression, the Sarma text covers an assortment topics in less than 500 pages. Following a brief chapter concerning research strategy, two detailed chapters are devoted to elementary EM nodes for data preprocessing and the project process flow in the graphical user interface. Concise theoretical discussion of variable clustering, factor analysis, and variable selection is presented in conjunction with simulated data exercises. Numerous screenshots are provided, depicting the varied EM options available to accommodate research cases with different levels of data. However, additional guidance on how to interpret output tables and statistics would prove useful to less experienced data analysts.

Decision trees, neural networks, and regression each receive individual chapters in the Sarma text. A comparative chapter combining these approaches proves critical to gaining greater insight into predictive models. A uniform pedagogical approach is employed, starting with theory and the underlying statistical equations, moving to a general example, and then guiding the reader through an EM exercise with downloadable data. Discussion of decision trees and neural networks is particularly strong, covering the fundamentals of each analytic technique requisite for further study of these topics. However, the chapter on regression may benefit from additional discussion of econometric assumptions and other model classes outside the scope of EM.

Comparing the output and predictive strength of each model is exemplified through a simplified risk model. Through the EM ensemble node, a predictive model combining regression, decision trees, and neural networks illustrates the ease of automating advanced research designs. Integral to Sarma’s guidance on combined models is a thorough discussion of gradient boosting and the underlying algorithms for the ensemble node. The author provides a cogent, well-reasoned exposition of each method.

Intermediate users of SAS and SAS Enterprise Miner

Authors de Ville and Neville provide a comprehensive discussion of decision trees within Enterprise Miner (EM). The first half of the text is highly readable and provides a logical introduction to decision trees, beginning with a review of statistical relationships and research context germane to valid inference. Historical developments ranging from automatic interaction detection (AID), Chi-squared AID (CHAID), XAID, and CRT (Classification and Regression Trees) are presented, with a balanced evaluation of the strengths and weaknesses of each model. Included citations ensure proper documentation and seek to stimulate further research into decision trees. Relevant diagrams and graphical depictions also serve to reinforce this conceptual introduction and are amenable to both introductory and intermediate decision tree modeling.

Central to this text is a highly detailed, multistep approach to building decision trees within EM. Six steps, from preprocessing data and selection of model parameters to the process of growing and pruning a tree, are discussed at the introductory to intermediate level of statistical rigor. This section is clearly designed to ease the reader from merely manipulating the decision tree node in EM to understanding the underlying mechanics of this methodology.

The latter half of the text covers topics such as multidimensional business intelligence cubes and theoretical concerns related to decision trees. Emphasis on business applications develops the relevancy of the text to applied modeling, alongside a comparative discussion between regression, decision trees, and neural networks. This ensures that decision trees are not discussed as a singular methodology but rather as a complementary data-mining tool within EM. Thus, de Ville and Neville write a strong reference text for decision trees, which could only benefit from additional guided exercises.

Elementary to Intermediate users of SAS and SAS Enterprise Miner

The Collica text provides a through overview of business concepts related to customer segmentation and clustering. The text is designed to effortlessly guide the reader from elementary descriptive statistics and tabular models to applied customer profiling via applications of processes such as Statistical Analysis System (SAS) distance calculations and the k-means algorithm. What proves unique to this text is the integration of the SAS code program node, which allows Collica to show the potential of Enterprise Miner when combined with statistical programming. Further, a process flow table is included for each example, effectively outlining each step through a particular analytic technique.

The structure of the text provides a strong basis for learning not only Enterprise Miner program nodes but how to approach increasingly advanced customer segmentation and clustering research projects. Part 2 begins with an elementary cell-based approach to segmentation that quickly evolves into a project on over 100,000 data points across many attributes. Collica continues this approach with increasingly advanced concepts all while maintaining a technical yet unambiguous expository style.

Enterprise Miner Review

Enterprise Miner (EM) is a powerful program that will appeal to a range of users. With a variety of program nodes for predictive modeling, this program can significantly automate large data analyses. However, learning the project process flow, program nodes, and associated model parameters can become a time-consuming process, wherein the user learns EM sans a comprehensive understanding of model assumptions and subsequent predictive modeling output. Selecting various parameters in EM may belie an understanding of statistical methodology and econometrics for inexperienced researchers.

A discussion from the Sarma text highlights this phenomenon. Variable selection can have an R-square criterion selected by the user; this option can drive researchers or analysts to seek values most agreeable to their research findings rather than understanding the causal structure of their respective model. Hence, while EM serves a diverse audience of analytics researchers and practitioners, it may prove most valuable to experienced analytics practitioners seeking expedited analyses.

However, as noted in each text, EM provides significant opportunities for combined analytic techniques and applications to myriad research projects and industries. This is particularly evident when the SAS Code program node is employed; when effectively utilized, applications in EM such as customer segmentation research evolves from mere rote methodological calculations to a statistical environment subject to the requirements of the researcher.

Questions of predictive validity and model selection are best addressed by advanced program users. Researchers and analysts who possess the technical skill and ability to engage model parameters, code, and integrated models will produce models that possess a higher probability of reaching informed predictive decision making through EM. This reflects not exclusively on the logical process flow structure of EM but rather on how training and associated texts may cover predictive modeling projects.