Cover image for Data mining : practical machine learning tools and techniques with Java implementations
Title:
Data mining : practical machine learning tools and techniques with Java implementations
Author:
Witten, I. H. (Ian H.)
Personal Author:
Publication Information:
San Francisco, Calif. : Morgan Kaufmann, [2000]

©2000
Physical Description:
xxv, 371 pages ; 24 cm
Language:
English
Added Author:
ISBN:
9781558605527
Format :
Book

Available:*

Library
Call Number
Material Type
Home Location
Status
Item Holds
Searching...
QA76.9.D343 W58 2000 Adult Non-Fiction Central Closed Stacks
Searching...

On Order

Summary

Summary

This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining--including both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource.


Complementing the authors' instruction is a fully functional platform-independent Java software system for machine learning, available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes.


Author Notes

Ian H. Witten is professor of computer science at the University of Waikato in New Zealand. He is a fellow of the ACM and the Royal Society of New Zealand and a member of professional computing, information retrieval, and engineering associations in the U.K., U.S., Canada, and New Zealand.
Eibe Frank is a researcher in the Machine Learning group at the University of Waikato. He holds a degree in computer science from the University of Karlsruhe in Germany and is the author of several papers, both presented at machine learning conferences and published in machine learning journals.


Table of Contents

Forewordp. v
Prefacep. xix
1 What's it all about?p. 1
1.1 Data mining and machine learningp. 2
Describing structural patternsp. 4
Machine learningp. 5
Data miningp. 7
1.2 Simple examples: The weather problem and othersp. 8
The weather problemp. 8
Contact lenses: An idealized problemp. 11
Irises: A classic numeric datasetp. 13
CPU performance: Introducing numeric predictionp. 15
Labor negotiations: A more realistic examplep. 16
Soybean classification: A classic machine learning successp. 17
1.3 Fielded applicationsp. 20
Decisions involving judgmentp. 21
Screening imagesp. 22
Load forecastingp. 23
Diagnosisp. 24
Marketing and salesp. 25
1.4 Machine learning and statisticsp. 26
1.5 Generalization as searchp. 27
Enumerating the concept spacep. 28
Biasp. 29
1.6 Data mining and ethicsp. 32
1.7 Further readingp. 34
2 Input: Concepts, instances, attributesp. 37
2.1 What's a concept?p. 38
2.2 What's in an example?p. 41
2.3 What's in an attribute?p. 45
2.4 Preparing the inputp. 48
Gathering the data togetherp. 48
Arff formatp. 49
Attribute typesp. 51
Missing valuesp. 52
Inaccurate valuesp. 53
Getting to know your datap. 54
2.5 Further readingp. 55
3 Output: Knowledge representationp. 57
3.1 Decision tablesp. 58
3.2 Decision treesp. 58
3.3 Classification rulesp. 59
3.4 Association rulesp. 63
3.5 Rules with exceptionsp. 64
3.6 Rules involving relationsp. 67
3.7 Trees for numeric predictionp. 70
3.8 Instance-based representationp. 72
3.9 Clustersp. 75
3.10 Further readingp. 76
4 Algorithms: The basic methodsp. 77
4.1 Inferring rudimentary rulesp. 78
Missing values and numeric attributesp. 80
Discussionp. 81
4.2 Statistical modelingp. 82
Missing values and numeric attributesp. 85
Discussionp. 88
4.3 Divide and conquer: Constructing decision treesp. 89
Calculating informationp. 93
Highly branching attributesp. 94
Discussionp. 97
4.4 Covering algorithms: Constructing rulesp. 97
Rules versus treesp. 98
A simple covering algorithmp. 98
Rules versus decision listsp. 103
4.5 Mining association rulesp. 104
Item setsp. 105
Association rulesp. 105
Generating rules efficientlyp. 108
Discussionp. 111
4.6 Linear modelsp. 112
Numeric predictionp. 112
Classificationp. 113
Discussionp. 113
4.7 Instance-based learningp. 114
The distance functionp. 114
Discussionp. 115
4.8 Further readingp. 116
5 Credibility: Evaluating what's been learnedp. 119
5.1 Training and testingp. 120
5.2 Predicting performancep. 123
5.3 Cross-validationp. 125
5.4 Other estimatesp. 127
Leave-one-outp. 127
The bootstrapp. 128
5.5 Comparing data mining schemesp. 129
5.6 Predicting probabilitiesp. 133
Quadratic loss functionp. 134
Informational loss functionp. 135
Discussionp. 136
5.7 Counting the costp. 137
Lift chartsp. 139
ROC curvesp. 141
Cost-sensitive learningp. 144
Discussionp. 145
5.8 Evaluating numeric predictionp. 147
5.9 The minimum description length principlep. 150
5.10 Applying MDL to clusteringp. 154
5.11 Further readingp. 155
6 Implementations: Real machine learning schemesp. 157
6.1 Decision treesp. 159
Numeric attributesp. 159
Missing valuesp. 161
Pruningp. 162
Estimating error ratesp. 164
Complexity of decision tree inductionp. 167
From trees to rulesp. 168
C4.5: Choices and optionsp. 169
Discussionp. 169
6.2 Classification rulesp. 170
Criteria for choosing testsp. 171
Missing values, numeric attributesp. 172
Good rules and bad rulesp. 173
Generating good rulesp. 174
Generating good decision listsp. 175
Probability measure for rule evaluationp. 177
Evaluating rules using a test setp. 178
Obtaining rules from partial decision treesp. 181
Rules with exceptionsp. 184
Discussionp. 187
6.3 Extending linear classification: Support vector machinesp. 188
The maximum margin hyperplanep. 189
Nonlinear class boundariesp. 191
Discussionp. 193
6.4 Instance-based learningp. 193
Reducing the number of exemplarsp. 194
Pruning noisy exemplarsp. 194
Weighting attributesp. 195
Generalizing exemplarsp. 196
Distance functions for generalized exemplarsp. 197
Generalized distance functionsp. 199
Discussionp. 200
6.5 Numeric predictionp. 201
Model treesp. 202
Building the treep. 202
Pruning the treep. 203
Nominal attributesp. 204
Missing valuesp. 204
Pseudo-code for model tree inductionp. 205
Locally weighted linear regressionp. 208
Discussionp. 209
6.6 Clusteringp. 210
Iterative distance-based clusteringp. 211
Incremental clusteringp. 212
Category utilityp. 217
Probability-based clusteringp. 218
The EM algorithmp. 221
Extending the mixture modelp. 223
Bayesian clusteringp. 225
Discussionp. 226
7 Moving on: Engineering the input and outputp. 229
7.1 Attribute selectionp. 232
Scheme-independent selectionp. 233
Searching the attribute spacep. 235
Scheme-specific selectionp. 236
7.2 Discretizing numeric attributesp. 238
Unsupervised discretizationp. 239
Entropy-based discretizationp. 240
Other discretization methodsp. 243
Entropy-based versus error-based discretizationp. 244
Converting discrete to numeric attributesp. 246
7.3 Automatic data cleansingp. 247
Improving decision treesp. 247
Robust regressionp. 248
Detecting anomaliesp. 249
7.4 Combining multiple modelsp. 250
Baggingp. 251
Boostingp. 254
Stackingp. 258
Error-correcting output codesp. 260
7.5 Further readingp. 263
8 Nuts and bolts: Machine learning algorithms in Javap. 265
8.1 Getting startedp. 267
8.2 Javadoc and the class libraryp. 271
Classes, instances, and packagesp. 272
The weka.core packagep. 272
The weka.classifiers packagep. 274
Other packagesp. 276
Indexesp. 277
8.3 Processing datasets using the machine learning programsp. 277
Using M5'p. 277
Generic optionsp. 279
Scheme-specific optionsp. 282
Classifiersp. 283
Meta-learning shemesp. 286
Filtersp. 289
Association rulesp. 294
Clusteringp. 296
8.4 Embedded machine learningp. 297
A simple message classifierp. 299
8.5 Writing new learning schemesp. 306
An example classifierp. 306
Conventions for implementing classifiersp. 314
Writing filtersp. 314
An example filterp. 316
Conventions for writing filtersp. 317
9 Looking forwardp. 321
9.1 Learning from massive datasetsp. 322
9.2 Visualizing machine learningp. 325
Visualizing the inputp. 325
Visualizing the outputp. 327
9.3 Incorporating domain knowledgep. 329
9.4 Text miningp. 331
Finding key phrases for documentsp. 331
Finding information in running textp. 333
Soft parsingp. 334
9.5 Mining the World Wide Webp. 335
9.6 Further readingp. 336
Referencesp. 339
Indexp. 351
About the authorsp. 371