Introduction to Data Mining


Chet Langin


Introduction to Data Mining
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar
Addison Wesley

This text book is an advanced introduction and reference to data mining from an algorithmic perspective. It is appropriate for graduate students and researchers. An instructor should select parts of the book for a curriculum rather than proceeding straight through from the beginning. The authors suggest even-numbered chapters for classroom instruction and most of these chapters do not have to be covered in order. (The chapter on anomaly detection, however, appropriately needs to be studied last, because it relies on information provided in earlier chapters.)

The explanation of some basic procedures such as C4.5, CART, and Page Rank, are left out. However, other more complicated procedures, such as SVM, are explained in some detail, making this book more appropriate for advanced courses than for introductory ones.

The first three chapters introduce the reader to data mining, data, and, general exploration of data. Data mining is introduced in Chapter 1 in terms of describing data and making predictions from it. Chapter 2 discusses types and quality of data, preprocessing data, and measures of similarity and dissimilarity. Summary statistics, visualization, and On-Line Analytical Processing (OLAP) are major themes covered in Chapter 3 to explore data.

The bulk of the book has two chapters each on classification, association, and clustering, which do not have to be read in order.

Basic concepts of classification are presented in Chapter 4 and include decision tree induction, as well as overfitting, evaluating performance of a classifier, and methods of comparing classifiers. Chapter 5 discusses alternative techniques of classifying, including rule-based classifiers, nearest neighbor classifiers, Bayes, artificial neural networks (ANN), support vector machines (SVM), and ensemble methods. This chapter also covers the issues of class imbalance and multiclasses.

Association analysis begins in Chapter 6 and concentrates on the Apriori principal to generate frequent itemsets and rules. Other topics in this chapter are compact representation of frequent itemsets, althernative methods for generating frequent itemsets, the FP-Growth Algorithm, evaluation of association patterns, and the effect of skewed support distribution. Chapter 7 continues with association analysis by considering categorical attributes, continuous attributes, a concept hierarchy, sequential patterns, subgraph patterns, and infrequent patterns.

Cluster analysis begins with basic concepts in Chapter 8 and continues with additional issues and algorithms in Chapter 9. Numerous clustering techniques are discussed, including K-means, agglomerative, DBSCAN, fuzzy, Expectation-Maximization (EM), Self-Organizing Maps (SOM), CLIQUE, DENCLUE, sparsification, Minimum Spanning Tree (MST), OPOSSUM, Chameleon, shared nearest neighbor similarity, Jarvis-Patrick, SSN density, BIRCH, and CURE. These chapters also discuss evaluating clusters and clustering algorithms. The authors note in Chapter 9 that a firm understanding of statistics and probability is required for some of these methods.

The last chapter covers anomaly detection and presumes that the reader is famliar with some of the concepts covered in previous chapters. Statistical, proximity-based, density-based, and clustering techniques are discussed.

Appendices include background information on linear algebra, dimensionality reduction, probability, statistics, regression, and optimization.

The outline of the book is not always parallel in the way that subsections are organized, which can cause confusion to the reader who is attempting to understand the context of what is being presented. For example, Chapter 5 has more than one subsection on types of classifiers, then a subsection each on a specific type of classifier, ensembles of classifiers, and a general problem having to do with classifying. A better outline would have made it more clear when the text was presenting types of classifiers, when it was explaining specfic examples of classifying algorithms, and when it was discussing meta information about classification. An introductory text should lead the student more gently into the maze of data mining concepts and methods.

This book, in summary, is a good reference that provides deeper information about data mining methods than can be easily found elsewhere. It is also appropriate as a supplemental text or for an advanced introductory course.

Chet Langin


Book review