Data is said to be sparse if only a small fraction (no more than 20%, often 3% or less) of the attributes are non-zero or non-NULL
for any given case. Sparse data occurs, for example, in market basket problems. In a grocery store, there might be 10,000 products in the store, and the average size of a basket (the collection of distinct products that a customer purchases in a typical transaction) might be 50 distinct products. In this example, a transaction (case or record) has at most 50 out of 10,000 attributes that are not NULL
.
This implies that the fraction of non-zero attributes in the table (or the density) is 50/10,000, or 0.5%. This density is typical for market basket and text processing problems.
Association models are designed to process sparse data; indeed, if the data is not sparse, the algorithm may require a large amount of temporary space and may not be able to build a model.
Oracle Data Miner by default considers all transactional data to be sparse.
The following algorithms support sparse data:
Algorithms that support sparse data also support text mining.
Different algorithms make different assumptions about what indicates sparse data as follows:
NULL
values indicate sparse data. Missing values are not automatically handled. If the data is not sparse and the values are indeed missing at random, it is necessary to perform some kind of missing values treatment. If you do not treat missing values, the algorithm will not handle the data correctly.NULL
values are treated as missing and not indicators of sparse data.For information about missing values, see Missing Values.
Copyright © 2006, 2008, Oracle. All rights reserved.