|
Solutions >Does My Model Reflect A Causal Relationship?
|
One powerful feature implemented in KnowledgeMiner (yX) for Excel
is external evaluation of self-organized linear and nonlinear analytic
models. This document is about to show how this new model evaluation
approach actively supports answering the above question. Also, a new model
quality measure that takes into consideration the noise filtering power
of the modeling algorithm and model complexity is introduced: Descriptive Power.
The Problem
A key problem in knowledge mining from data is final evaluation of developed models.
This evaluation process is an important condition for deployment of models obtained by
Inductive Learning. By learning from a finite set of data, only, it is hardly possible to
decide whether the estimated model can reflect the causal relationship between input and
output, adequately, or if it's just a stochastic model with non-causal correlations. Model
evaluation needs, in addition to a properly working noise filtering procedure for avoiding
overfitting the learning data, some new, external information to justify a model's quality,
i.e., both its predictive and descriptive power.
Why
Let's have a look at this example: Based on an artificial data set of 2 outputs,
4 inputs, and 15 samples KnowledgeMiner (yX) self-organizes an analytical model for
each output variable, Y1 and Y2 (fig.1).
a) Model 1: Y1=f1(x)
b) Model 2: Y2=f2(x)
Figure 1. Model (red) vs. actual (blue, overlaid) graph of the two models.
For model 1, a model quality Q of 0.9998 (with 1.0 as the best possible and zero
as the worst model quality) is reported, while model 2 shows a model quality of 0.9997.
Concluding from this model quality and from the graphs in fig. 1 there is no obvious reason
to not consider both models as "true" models that reflect a causal relation between input
and output. Also, taking into account that KnowledgeMiner (yX), compared to the vast majority
of data mining tools, is implementing in its inductive, self-organizing model synthesis a
powerful noise filtering procedure, already (see also "Self-Organising Data
Mining" book,
section 3.2), this seems to underline the above assumption.
However, the person who created the data set for this example states that only
one model actually describes a causal relationship while the other model simply reflects
some stochastic correlations, because output and inputs are completely independent one another
(random numbers). Even with this information given - which is usually not the case for
real-world knowledge and data mining problems - the modeler cannot decide from the available
information which of the two models is the true model. Only applying (predicting) the models
on some new data - which adds new information - will turn out the true model (fig. 2):
a) Model 1: invalid
b) Model 2: valid
Figure 2. Prediction of samples 16 to 20 by the two models for Y1 and Y2.
This example clearly shows that any "closeness-of-fit" measure is not sufficient
to evaluate a model's predictive and descriptive power. Recent research has shown
that model evaluation requires a two-stage validation approach (at least):
1. Level
Noise filtering to avoid overfitting the learning data based on external information
(hypothesis testing) not used for creating a model candidate (hypothesis) as an
integrated part of the "Model Learning" process. A corresponding tool that has been using
in KnowledgeMiner (yX) from the beginning within "Model Learning" is leave-one-out cross-validation.
2. Level
A characteristic that describes the noise filtering behavior of the "Model Learning"
process to justify model quality based on external information not yet used in the
first validation level. This noise-filtering characteristic is implemented in KnowledgeMiner
for the first time for linear and nonlinear analytical models. This characteristic was obtained
by running Monte Carlo simulations many times. In this way, new and independent external
knowledge is available that any model has to be adjusted with.
Figure 3 shows a detail of the characteristic for linear analytical models.
Figure 3. Noise filtering characteristic
M: number of inputs; N: number of samples; Qu: virtual
quality of a model
Qu=1: noise filtering does not work at all; Qu=0:
ideal filtering
The reason for a second level validation is (1) that noise
filtering implemented in level 1 is very likely to not being an ideal noise
filter and thus not working properly in any case (see example) and (2) to get a
new model quality measure that is adjusted by the noise filtering power of the
algorithm.
The
noise filtering characteristic expresses a virtual model quality Qu that
can be obtained when using a data set of M potential inputs of N random
samples. It is virtual model quality, because, by definition, there is not any
causal relationship between stochastic variables (true model quality Q = 0), but there
are actually models of quality Q > 0, which, when using random samples (see example above), just reflect
stochastic correlations. In result, given any number of potential
inputs M and number of samples N, a threshold quality Qu = f(N, M) can
be calculated by KnowledgeMiner that any model's quality Q must exceed exceed to be considered
valid with respect to describing a relevant relationship between input and output. Otherwise, a
model of quality Q <= Qu is assumed invalid, since its quality Q can also be
obtained when simply using random variables, which means that this certain model's
quality does not significantly differ from a chance model. It has to be considered garbage.
In addition to deciding if a model appears being valid or not, the noise filtering
characteristic is also a tool for quantifying to which extent the data is described
by a relevant relationship between input and output. This introduces a new, noise
filtering and model complexity adjusted model quality measure: Descriptive Power (DP), which is defined as:
whith Q as the
measured quality of the evaluated model and Qu(N, L) as the reference
quality calculated from the number of samples N the model was created on and
from the number of input variables L the model is actually composed of
(selected relevant inputs), with L <= M. This means that Descriptive Power is a
chance-correlation-adjusted quality measure, which is independent from the data set
dimension used to develop the model. For example, two models M1 and M2
show the same quality Q = Q1 = Q2, but M1 uses more
inputs than M2 to get that quality Q. So with L1 > L2,
the Descriptive Power of
M2 is higher than that of M1.
The bottom line
KnowledgeMiner (yX) for Excel evaluates a developed model by calculating its
Descriptive Power after modeling on the fly. You don't have to care about it.
KnowledgeMiner will provide all information in the model report to make you more
effective and successful in your knowledge mining efforts.
Back to our example above, KnowledgeMiner shows this evaluation information after modeling
in the report for the two models (fig. 4):
a) Report of Model 1 --> status: invalid
b) Report of Model 2 --> status: valid
Figure 4. Reported evaluation results of the two models
This means, the modeler knows instantly that model 2 does well indeed with a
Descriptive Power of 42% while model 1 is seen invalid to 33%. Following the recommendation
given in the report of model 1, increasing the number of samples to 21, in a second modeling
run KnowledgeMiner (yX) now comes up with this report (fig. 5):
Figure 5. Evaluation result of model 1 after remodeling --> status: invalid with increased likelihood of chance model.
KnowledgeMiner now reports an increased certainty of 67% that this model is
just a chance model and therefore has to be rejected. Interesting to note is also
that this tiny modeling problem has been identified as high-dimensional modeling task,
which sounds strange, first. However, "high-dimensional" has to be seen not only in
absolute but also in relative terms: every modeling problem with a high number of
inputs-to-number of samples ratio is a high-dimensional modeling task, actually,
with respect to model validation and reliability and has to be handled as such.
Summarizing this example, the two-stage model validation approach implemented in
KnowledgeMiner (yX) for Excel allows, for the first time, getting an active decision
support in model evaluation for minimizing the risk of false interpreting a model's
quality and power and using invalid models for prediction and classification tasks
that in fact just reflect a chance correlation.
©2011, Frank Lemke
|