How to Use Stratified Sampling

Note: It is not always necessary to create a stratified sample. If you specify Maximum Average Accuracy for a classification or regression model, it may not be necessary to stratify the build data set. For information about accuracy, see Accuracy Type.

If the distribution of target values is skewed greatly, it may be necessary to create a build data set with an artificially balanced distribution. For example, fraud detection or response to a marketing campaign may have a positive target value 1% of the time or less. Any data mining algorithm usually needs more than 1% positive examples to learn the factors that differentiate positive from negative target values. Therefore, it is necessary to sample from the source data in a manner that captures an artificially large segment of positive values along with negative values so that the model is created with a well-defined profile of individuals with positive target values.

Note: Even if the build data is a stratified sample, the test data must have the natural distribution, that is, the test data set should not be stratified.

If you need to create a stratified s build data set, proceed as follows:

  1. Split the data into build and test data sets using the Split transformation.
  2. Create a stratified sample of the build data set for input to the build process.
  3. Create a Build mining activity that uses the stratified build data set as input; do not perform the Split step or the Test step in this activity.
  4. Create a Test mining activity that uses the non-stratified test data set to test the model.

Note: The Stratified Sample wizard does not do over-sampling.

Follow these steps to build a stratified sample to use as the build data set:

  1. Use Data | Transform | Split to create build and test data sets. Suppose that the build data set is named MYBUILDDATA.
  2. Determine the distribution of values for the target in the build data set. You may be able to use Show Summary Single-Record or Show Summary Multi-Record to do this, if the sample size used to create the histogram of the target is the same as the number of records in MYBUILDDATA.

    You can always use Tools | SQL Worksheet to find this information. For example, if the data set is named MYBUILDDATA and the target is named TARGET, the following query returns the number of cases where TARGET=1:

    SELECT COUNT(*) FROM MYBUILDDATA WHERE TARGET=1;

    Make a note of the numbers of positive and negative cases.

  3. Use Data | Transform | Stratified Sample to create the stratified sample. Select the build data set that you just created using Split. In Step 3 of the wizard, select the target attribute from the Attribute pulldown list.

    The goal is to create a sample with approximately equal numbers of positive and negative values for the target attribute. Click the radio button next to Sample Size and enter a value equal to twice the number of positive target values in the build data set. If you have 100 positive values, you want to create a sample with 100 positive and 100 negative values; that is, a Sample Size of 200. Click Next.

  4. In step 4 of Stratified Sample, click Equal Distribution to specify an equal distribution of values. Click Next and then Finish to create the table. This new table is the one that you will use to build the model.

Once the table is created, you can use the appropriate version of Show Summary to display a histogram of the target attribute. Note that due to the sampling method, the totals are not quite identical.

For an example of creating a stratified sample and using it to build a model, see the Oracle Data Mining Tutorial.