Note: It is not always necessary to create a stratified sample. If you specify Maximum Average Accuracy for a classification or regression model, it may not be necessary to stratify the build data set. For information about accuracy, see Accuracy Type.
If the distribution of target values is skewed greatly, it may be necessary to create a build data set with an artificially balanced distribution. For example, fraud detection or response to a marketing campaign may have a positive target value 1% of the time or less. Any data mining algorithm usually needs more than 1% positive examples to learn the factors that differentiate positive from negative target values. Therefore, it is necessary to sample from the source data in a manner that captures an artificially large segment of positive values along with negative values so that the model is created with a well-defined profile of individuals with positive target values.
Note: Even if the build data is a stratified sample, the test data must have the natural distribution, that is, the test data set should not be stratified.
If you need to create a stratified s build data set, proceed as follows:
Note: The Stratified Sample wizard does not do over-sampling.
Follow these steps to build a stratified sample to use as the build data set:
You can always use Tools | SQL Worksheet to find this information. For example, if the data set is named MYBUILDDATA and the target is named TARGET, the following query returns the number of cases where TARGET=1:
SELECT COUNT(*) FROM MYBUILDDATA WHERE TARGET=1;
Make a note of the numbers of positive and negative cases.
The goal is to create a sample with approximately equal numbers of positive and negative values for the target attribute. Click the radio button next to Sample Size and enter a value equal to twice the number of positive target values in the build data set. If you have 100 positive values, you want to create a sample with 100 positive and 100 negative values; that is, a Sample Size of 200. Click Next.
Once the table is created, you can use the appropriate version of Show Summary to display a histogram of the target attribute. Note that due to the sampling method, the totals are not quite identical.
For an example of creating a stratified sample and using it to build a model, see the Oracle Data Mining Tutorial.
Copyright © 2006, 2008, Oracle. All rights reserved.