DataModeler FAQ

The DataModeler FAQ is divided into three sections: Conceptual Questions, Practical Questions and Illustrations. The first two provide textual responses to common questions whereas the last consists of very short examples illustrating key points.

A caveat is that the illustrations devote VERY little time to the model development — ranging from three seconds to two minutes. Although sufficient for the illustration purposes, ten seconds should not be construed as a typical time constraint since devoting additional CPU cycles is generally beneficial in developing quality models.

In addition to the html sections presented below (which will open in separate windows when clicked), there are PDF and Mathematica notebook versions available as part of the selected help extracts.

Conceptual Questions

The conceptual foundations used in the modeling.

Practical Questions

Practical issues related to the mechanics of model development.

Hypothesis Generation

Conventional methods impose artificial constraints on the models — e.g., polynomials — despite the fact that these constraints do not have a physical basis and are only imposed to make the mathematics tractable for that method or due to a lack of imagination. SymbolicRegression, in contrast, lets the data define the model form free of artificial limits. Part of this is the ability to hypothesize and explore diverse potential model structures.

Rapid Modeling

SymbolicRegression can identify driving variables and return quality models — in just a few seconds in some cases.

Fat Data Sets

Due to its ability to identify and focus on driving variables, SymbolicRegression can build models data sets  that have more variables than records.

Redundant Data

Real-world data sets often contain redundant data records — i.e., the nearly the same information repeated many times. For any data modeling technique, redundant data slows the model development and may also degrade model quality. DataModeler offers tools for ranking data records based upon their incremental information content. This insight may be used to accelerate the model development as well as produce higher-quality models

Big Data Sets

DataModeler's SymbolicRegression algorithms are state-of-the-art and remarkably efficient. However, for really large data sets, there are strategies which can be useful beyond simply allocating more CPU effort. The key is to recognize that for large data sets, generally, not all the data has the same information content and to exploit that fact.

Correlated Data

Although common, correlated variables are a major problem for most modeling techniques. In contrast, SymbolicRegression can identify the best of correlated inputs and may synthesize metavariables to produce insightful, robust and quality models.

Noisy Data

Modeling noisy data is difficult since we want to model the fundamental behavior and not noise-induced perturbations. DataModeler and SymbolicRegression lets us easily identify the models providing the best trade-off between model complexity, accuracy and constituent variables.

Trustable Models

Empirical models only know the information they have been provided during their development. As a result, using them is a bit like driving a car only using the rear-view mirror. Ensembles of diverse but accurate models let us take some of the trepidation out of model use by providing a warning that either the system dynamics have changed or the model is being asked to operate in uncharted territory. Trustable Models is a unique and valuable benefit of SymbolicRegression which is possible because we can develop diverse model structures which are comparable in both accuracy and complexity.

Outlier Detection

Outlier detection for nonlinear systems with lots of input variables is very hard to achieve using conventional methods. However, an outlier is either the most important nugget in the data set or something which should be removed from the modeling process to avoid distorting the results. Deciding which requires human insight. DataModeler provides tools for outlier detection both before and after the model development. Here we look at model-based outlier detection.

Active DOE

The conventional approach to experimental design is to: 

(a) make a bunch of simplifying assumptions

(b) assume a model form based upon ease of data analysis

(c) run a batch of experiments and

(d) check whether the data confirmed the a priori assumptions and, if not, start over.

The ability of SymbolicRegression to synthesize, assess and refine models and, furthermore, build a trust metric on WHERE those models are valid means that we can integrate modeling and data collection and shift from a passive collect-data-then-analyze mode into an active DOE framework. The benefits are HUGE. At the end of the data collection we have BOTH an awareness of significant variables AND a quality response model. Furthermore, at each step of the process, we have chosen the next data point to maximize the anticipated information content — thereby achieving a better result with few experiments than if a conventional approach were adopted. For some systems, an active DOE strategy could require orders-of-magnitude fewer experiments be conducted. The implications for time-to-market, product quality and customer satisfaction should be obvious.