Glossary

Shortcuts to Letters

A B C D E F G J K L M N O P Q R S T U V W Z

A

ABN

See adaptive bayes network. ABN is deprecated in Oracle Data Mining 11g..

active learning

A feature of the support vector machine algorithm that provides a way to deal with large build sets.

activity

See mining activity.

adaptive bayes network

An Oracle proprietary classification algorithm. ABN provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. It operates in three modes: pruned naive bayes, single feature, and multi-feature. Single feature mode produces rules.

ABN is deprecated in Oracle Data Mining 11g; Oracle Data Miner 11.1 does not support building ABN models.

aggregation

The process of consolidating data values into a smaller number of values. For example, sales data could be collected on a daily basis and then be totaled to the week level.

AI

See attribute importance.

algorithm

A sequence of steps for solving a problem. See data mining algorithm. Oracle Data Miner supports the following algorithms: ABN, MDL, apriori, decision tree, k-means, naive bayes, non-negative matrix factorization, O-cluster, and support vector machine.

algorithm settings

The settings that specify algorithm-specific behavior for model building. These settings are usually called build settings in Oracle Data Miner.

anomaly detection

The detection of outliers or atypical cases. Anomaly detection problems can be solved using the one-class support vector machine algorithm.

apply

The data mining operation that scores data, that is, uses the model with new data to predict results.

apriori

Uses frequent itemsets to calculate associations.

AR

See association rules.

association

A machine learning technique that identifies relationships among items.

association rules

A mining function that captures co-occurrence of items among transactions. A typical rule is an implication of the form A -> B, which means that the presence of item set A implies the presence of item set B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the item sets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the item sets A and B are present to the number of transactions where item set A is present. ODM uses the Apriori algorithm for association models.

attribute

An attribute corresponds to a column in a database table. An attribute has a name and a datatype. Each attribute in a record holds an item of information. Attribute names are constant from record to record for tables that are not nested tables. Attributes are also called variables, features, data fields, or table columns. See also target.

attribute importance

A mining function providing a measure of the importance of an attribute in predicting a specified target. The measure of different attributes of a build data table enables users to select the attributes that are found to be most relevant to a mining model. A smaller set of attributes results in a faster model build; the resulting model could be more accurate. ODM uses the minimum description length to discover important attributes. Sometimes referred to as feature selection or key fields.

B

binning

See discretization.

build settings

In Oracle Data Miner, settings on the Build tab of Advanced Options that control model build.

C

case

All the data collected about a specific transaction or related set of values. A data set is a collection of cases. Cases are also called records or examples. In the simplest situation, a case corresponds to a row in a table.

categorical attribute

An attribute whose values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, etc., or ordered (ordinal) such as high, medium, or low temperatures.

category

In the Java interface, corresponds to a distinct value of a categorical attribute. Categories may have character or numeric values.

centroid

See cluster centroid.

classification

A mining function for predicting categorical target values for new records using a model built from records with known target values. ODM supports the following algorithms for classification: Naive Bayes, Adaptive Bayes Networks (ABN), Decision Tree, and Support Vector Machines, and Generalized Linear Models (Logistic Regression). ABN is deprecated in Oracle Data Mining 11g.

clipping

See trimming.

cluster centroid

The vector that encodes, for each attribute, either the mean (if the attribute is numerical) or the mode (if the attribute is categorical) of the cases in the build data assigned to a cluster. A cluster centroid is often referred to as "the centroid."

clustering

A mining function for finding naturally occurring groupings in data. More precisely, given a set of data points, each having a set of attributes, and a similarity measure among them, clustering is the process of grouping the data points into different clusters such that data points in the same cluster are more similar to one another and data points in different clusters are less similar to one another. ODM supports two algorithms for clustering,k-means and orthogonal partitioning clustering.

confusion matrix

Measures the correctness of predictions made by a model from a test task. The row indexes of a confusion matrix correspond to actual values observed and provided in the test data. The column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing.

When predicted value equals actual value, the model produces correct predictions. All other entries indicate errors.

connection

In Oracle Data Miner, the information necessary to connect to a data mining server in an Oracle database. Oracle Data Miner requires a server connection to perform data mining tasks.

cost matrix

An n by n table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models, where n is the number of distinct values in the target, and the columns and rows are labeled with target values. The rows are the actual values; the columns are the predicted values.

counterexample

Negative instance of a target. Counterexamples are required for classification models, except for one-class support vector machines.

D

data mining

The process of discovering hidden, previously unknown, and usable information from a large amount of data. This information is represented in a compact form, often referred to as a model.

data mining algorithm

A specific technique or procedure for producing a data mining model. An algorithm uses a specific model representation and may support one or more mining functions. The algorithms in the ODM programming interfaces are naive bayes, adaptive bayes network, support vector machine, decision tree, and generalized linear models for classification; support vector machine and generalized linear models for regression; k-means and O-cluster for clustering; minimum description length for attribute importance; non-negative matrix factorization for feature extraction; apriori for associations, and one-class support vector machine for anomaly detection.

data mining server

The component of the Oracle database that implements the data mining engine and persistent metadata repository. You must connect to a data mining server before performing data mining tasks. You connect to a data mining server when you start Oracle Data Miner.

data set

In general, a collection of data. A data set is a collection of cases.

descriptive model

A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. See also mining model.

discretization

Discretization groups related values together under a single value (or bin). This reduces the number of distinct values in a column. Fewer bins result in models that build faster. Many ODM algorithms (NB, ABN, etc.) may benefit from input data that is discretized prior to model building, testing, computing lift, and applying (scoring). Different algorithms may require different types of binning. Oracle Data Mining includes transformations that perform top N frequency binning for categorical attributes and equi-width binning and quantile binning for numerical attributes.

distance-based (clustering algorithm)

Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used.

decision tree

A decision tree is a representation of a classification system or supervised model. The tree is structured as a sequence of questions; the answers to the questions trace a path down the tree to a leaf, which yields the prediction.

Decision trees are a way of representing a series of questions that lead to a class or value. The top node of a decision tree is called the root node; terminal nodes are called leaf nodes. Decision trees are grown through an iterative splitting of data into discrete groups, where the goal is to maximize the distance between groups at each split.

An important characteristic of the decision tree models is that they are transparent; that is, there are rules that explain the classification.

See also rule.

DMS

See data mining server.

DT

See decision tree.

E

equi-width binning

Equi-width binning determines bins for numerical attributes by dividing the range of values into a specified number of bins of equal size.

explode

For a categorical attribute, replace a multi-value categorical column with several binary categorical columns. To explode the attribute, create a new binary column for each distinct value that the attribute takes on. In the new columns, 1 indicates that the value of the attribute takes on the value of the column; 0, that it does not. For example, suppose that a categorical attribute takes on the values {1, 2, 3}. To explode this attribute, create three new columns, col_1, col_2, and col_3. If the attribute takes on the value 1, the value in col_1 is 1; the values in the other two columns is 0.

F

feature

A combination of attributes in the data that is of special interest and that captures important characteristics of the data. See feature extraction.

See also network feature and text feature.

feature extraction

Creates a new set of features by decomposing the original data. Feature extraction lets you describe the data with a number of features that is usually far smaller than the number of original attributes. See also non-negative matrix factorization.

G

generalized linear models

A statistical technique for linear modeling. Generalized linear models (GLM) include and extend the class of simple linear models. Oracle Data Mining supports logistic regression for GLM classification and linear regression for GLM regression.

GLM

See generalized linear models.

J

JDM

See Java Data Mining.

Java Data Mining

A Pure Java API that facilitates development of data mining-enabled applications. Java Data Mining (JDM) supports common data mining operations, as well as the creation, persistence, access, and maintenance of meta data supporting mining activities. JDM is described in the Java Community Process Specification JSR-73. The Java interface to Oracle Data Mining is a compliant subset of JDM.

K

k-means

A distance-based clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used. ODM provides an enhanced version of k-means.

KM

See k-means.

L

lift

A measure of how much better prediction results are using a model than could be obtained by chance. For example, suppose that 2% of the customers mailed a catalog make a purchase; suppose also that when you use a model to select catalog recipients, 10% make a purchase. Then the lift for the model is 10/2 or 5. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a data table with actual outcomes, lift compares how well a model performs with respect to this data on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on new data.

lineage

The sequence of transformations performed on a data set during the data preparation phase of the model build process.

M

MDL

See minimum description length.

min-max normalization

Normalize each attribute using the transformation x_new = (x_old-min)/ (max-min).

minimum description length

Given a sample of data and an effective enumeration of the appropriate alternative theories to explain the data, the best theory is the one that minimizes the sum of

This principle is used to select the attributes that most influence target value discrimination in attribute importance.

mining activity

In Oracle Data Miner, a mining activity is a step-by-step guide to model build, model test, or model apply. The steps indicate the order in which operations must be applied and appropriate defaults for the operations. There are three basic mining activities: build activity to build and test a model, test activity to test a classification or regression model, and apply activity to apply a model to new data (score new data using the model). Test and apply activities can be used with models built using either of the Oracle Data Mining programmatic interfaces.

mining function

A major subdomain of data mining that shares common high level characteristics. The ODM programming interfaces support the following mining functions: classification, regression, attribute importance, feature extraction, and clustering. In both programming interfaces, anomaly detection is supported as classification.

mining model

An important function of data mining is the production of a model. A model can be a supervised model or an unsupervised model. Technically, a mining model is the result of building a model from mining settings. The representation of the model is specific to the algorithm specified by the user or selected by the DMS. A model can be used for direct inspection, e.g., to examine the rules produced from an ABN model or association models, or to score data.

mining object

Mining tasks, models, settings, and their components.

mining result

The end product(s) of a mining task. For example, a build task produces a mining model; a test task produces a test result.

mining task

See task.

missing value

A data value that is missing because it was not measured (that is, has a null value), not answered, was unknown, or was lost. Data mining algorithms vary in the way they treat missing values. There are several typical ways to treat them: ignore then, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values.

model

See mining model.

multi-record case

Each case in the data is stored as multiple records in a table with columns sequenceID, attribute_name, and value. Also known as transactional format. Oracle Data Miner requires that data for association models be in transactional format. See also single-record case.

N

naive bayes

An algorithm for classification that is based on Bayes's theorem. Naive Bayes makes the assumption that each attribute is conditionally independent of the others: given a particular value of the target, the distribution of each predictor is independent of the other predictors.

NB

See naive bayes.

nested table

An unordered set of data elements, all of the same datatype. It has a single column, and the type of that column is a built-in type or an object type. If an object type, the table can also be viewed as a multicolumn table, with a column for each attribute of the object type. Nested tables can contain other nested tables. Oracle Data Miner does not permit the use of nested tables as the data for mining activities.

network feature

A network feature is a tree-like multi-attribute structure. From the standpoint of the network, features are conditionally independent components. Features contain at least one attribute (the root attribute). Network features are used in the Adaptive Bayes Network algorithm.

NMF

See non-negative matrix factorization.

non-negative matrix factorization

A feature extraction algorithm that decomposes multivariate data by creating a user-defined number of features, which results in a reduced representation of the original data.

normalization

Normalization consists of transforming numerical values into a specific range, such as [–1.0,1.0] or [0.0,1.0] such that x_new = (x_old-shift)/scale. Normalization applies only to numerical attributes. Oracle Data Mining provides transformations that perform min-max normalization, scale normalization, and z-score normalization.

numerical attribute

An attribute whose values are numbers. The numeric value can be either an integer or a real number. Numerical attribute values can be manipulated as continuous values. See also categorical attribute.

O-cluster

See orthogonal partitioning clustering.

O

O-cluster

See orthogonal partitioning clustering.

OC

See orthogonal partitioning clustering.

one-class support vector machine

The version of the support vector machine model used to solve anomaly detection problems. Anomaly detection is a special kind of classification. Standard classification algorithms require the presence of both positive and negative examples (counterexamples) for a target class. One-class support vector machine classification requires only the presence of examples of a single target class. The model learns to discriminate between the known examples of the positive class and the unknown negative set of counterexamples. The goal is to estimate a function that will be positive if an example belongs to a set and negative or zero if the example belongs to the complement of the set.

orthogonal partitioning clustering

An Oracle proprietary clustering algorithm that creates a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters.

outlier

A data value that does not come from the typical population of data; in other words, extreme values. In a normal distribution, outliers are typically at least 3 standard deviations from the mean.

P

positive target value

In binary classification problems, you may designate one of the two classes (target values) as positive, the other as negative. When ODM computes a model's lift, it calculates the density of positive target values among a set of test instances for which the model predicts positive values with a given degree of confidence.

predictive model

A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. A predictive model is a supervised model.

predictor

An attribute used as input to a supervised model or algorithm to build a model.

prepared data

Data that is suitable for model building using a specified algorithm. Data preparation often accounts for much of the time spent in a data mining project. ODM includes transformations to perform common data preparation functions (binning, normalization, etc.)

prior probabilities

The set of prior probabilities specifies the distribution of examples of the various classes in the original source data. Also referred to as priors, these could be different from the distribution observed in the data set provided for model build.

priors

See prior probabilities.

Q

quantile binning

A numerical attribute is divided into bins such that each bin contains approximately the same number of cases.

R

random sample

A sample in which every element of the data set has an equal chance of being selected.

recode

Literally "change or rearrange the code." Recoding can be useful in many instances in data mining. Here are some examples:

record

See case.

regression

A data mining function for predicting continuous target values for new records using a model built from records with known target values. ODM provides the Support Vector Machine algorithm for regression.

rule

An expression of the general form if X, then Y. An output of certain algorithms, such as clustering, association, decision tree, and ABN. The predicate X may be a compound predicate.

S

sample

See random sample.

scale normalization

Normalize each attribute using the transformation x_new = (x-0)/ max(abs(max), abs(min)).

schema

Database schema, that is, a collection of database objects, including logical structures such as tables, views, sequences, stored procedures, synonyms, indexes, clusters, and database links.

score

Scoring data means applying a data mining model to data to generate predictions. See apply.

scoring engine

An instance of Oracle Data Mining that can be used to apply or score models, but cannot be used to build models.

settings

See algorithm settings and build settings.

single-record case

Each case in the data is stored as one record (row) in a table. Contrast with multi-record case.

sparse data

Data for which only a small fraction of the attributes are non-zero or non-null in any given case. Market basket data and text mining data are often sparse.

split

Divide a data set into several disjoint subsets. For example, in a classification problem, a data set is often divided in to a build data set and a test data set.

stratified sample

Divide the data set into disjoint subsets (strata) and then take a random sample from each of the subsets. This technique is used when the distribution of target values is skewed greatly. For example, response to a marketing campaign may have a positive target value 1% of the time or less. A stratified sample provides the data mining algorithms with enough positive examples to learn the factors that differentiate positive from negative target values. See also random sample.

stratified split

Divide the data set into disjoint subsets while preserving the distribution of a selected attribute. For example, create a build table and a test table in such a way that the distribution of the target attribute for both tables is the same as the distribution of the target attribute in the original table. See also split.

supervised learning

See supervised model.

supervised model

A data mining model that is built using a known dependent variable, also referred to as the target. Classification and regression techniques are examples of supervised mining. See unsupervised model. Also referred to as predictive model.

support vector machine

An algorithm that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. Support vector machines can make predictions with sparse data, that is, in domains that have a large number of predictor columns and relatively few rows, as is the case with bioinformatics data. Support vector machine can be used for classification, regression, and anomaly detection.

SVM

See support vector machine.

T

table

The basic unit of data storage in an Oracle database. Table data is stored in rows and columns.

target

In supervised learning, the identified attribute that is to be predicted. Sometimes called target value or target attribute. See also attribute.

test metrics

For classification, the accuracy, confusion-matrix, lift, and receiver-operating characteristics can be computed to access the model. Similarly for regression, R-squared and RMS errors can be computed.

text feature

A combination of words that captures important attributes of a document or class of documents. Text features are usually keywords, frequencies of words, or other document-derived features. A document typically contains a large number of words and a much smaller number of features.

text mining

Conventional data mining done using text features. Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data. Both ODM and Oracle Text support text mining.

Note: Not all algorithms support text mining. See Text Mining for more information.

top N frequency binning

This type of binning bins categorical attributes. The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N corresponds to the values with top frequencies. The bin bin_N+1 corresponds to all remaining values.

transactional format

Each case in the data is stored as multiple records in a table with columns sequenceID, attribute_name, and value. Also known as multi-record case. Compare with single-record case.

transformation

A function applied to data resulting in a new representation of the data. For example, discretization and normalization are transformations on data.

transparency

Most mining models are built using transformed data; therefore, model build, model test, and model apply results are expressed in terms of the transformed data, not the original data. For example, you might get rules expressed in terms of bin numbers instead of actual values. If you use Oracle Data Miner build and apply activities, results are expressed in terms of the original data, not the transformed data.

trimming

A technique used for dealing with outliers. Trimming removes values in the tails of a distribution in the sense that trimmed values are ignored in further computations. This is achieved by setting the tails to NULL.

U

unstructured data

Images, audio, video, geospatial mapping data, and documents or text data are collectively known as unstructured data. ODM supports the mining of unstructured text data.

unsupervised learning

See unsupervised model.

unsupervised model

A data mining model built without the guidance (supervision) of a known, correct result. In supervised learning, this correct result is provided in the target attribute. Unsupervised learning has no such target attribute. Clustering and association are examples of unsupervised mining functions. See supervised model.

V

view

A view takes the output of a query and treats it as a table. Therefore, a view can be thought of as a stored query or a virtual table. You can use views in most places where a table can be used.

W

winsorizing

A way of dealing with outliers. Winsorizing involves setting the tail values of an particular attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the maximum value in the 95th percentile.

Z

z-score normalization

Normalize numerical attributes using the mean and standard deviation computed from the data. Normalize each attribute using the transformation x_new = (x-mean)/standard deviation.