Sparsity

Data is said to be sparse if only a small fraction (no more than 20%, often 3% or less) of the attributes are non-zero or non-NULL for any given case. Sparse data occurs, for example, in market basket problems. In a grocery store, there might be 10,000 products in the store, and the average size of a basket (the collection of distinct products that a customer purchases in a typical transaction) might be 50 distinct products. In this example, a transaction (case or record) has at most 50 out of 10,000 attributes that are not NULL. This implies that the fraction of non-zero attributes in the table (or the density) is 50/10,000, or 0.5%. This density is typical for market basket and text processing problems.

Association models are designed to process sparse data; indeed, if the data is not sparse, the algorithm may require a large amount of temporary space and may not be able to build a model.

Oracle Data Miner by default considers all transactional data to be sparse.

The following algorithms support sparse data:

Algorithms that support sparse data also support text mining.

Different algorithms make different assumptions about what indicates sparse data as follows:

For information about missing values, see Missing Values.