Outliers in Oracle Data Mining
An outlier is a value that is far outside the normal range in a data set, typically a value that is several standard deviations from the mean. The presence of outliers can have a significant impact on Oracle Data Mining models.
Outliers affect the different algorithms as follows:
- Attribute Intelligence, Naive Bayes: The presence of outliers when equal-width binning is used makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of these algorithms may be significantly reduced. In the case of ABN, if all attributes have outliers, ABN may not even be able to build a tree beyond a first split.
- Association Models: The presence of outliers when equal-width binding is used makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of AR to detect differences in numerical attributes may be significantly lessened. For example, a numerical attribute such as income may have all the data belonging to a single bin except for one entry (the outlier) that belongs to a different bin. As a result, there won't be any rules reflecting different levels of income. All rules containing income will only reflect the range in the single bin; this range is basically the income range for the whole population
- O-Cluster: The presence of outliers when equal-width binding is used will make most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of O-Cluster to detect clusters may be significantly impacted. If the whole data is divided among a few bins, it may look as if there are no clusters, that is, that the whole population falls in a single cluster.
- k-Means: The presence of outliers when equal-width binding is used will make most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of k-Means to create clusters that are different in content may be significantly impacted. If the whole data is divided among a few bins, then clusters may have very similar centroids, histograms, and rules.
- Support Vector Machine: The presence of outliers when Min-Max normalization is used will make most of the data concentrate in a small range. As a result it will make learning harder and lead to longer training times.
- Non-Negative Matrix Factorization: The presence of outliers when Min-Max normalization is used will make most of the data concentrate in a small range. This will result in poor matrix factorization in general. To improve the matrix factorization the error tolerance would need to be decreased. This in turn would lead to longer build times.
Copyright © 2006, 2008, Oracle. All rights
reserved.