Illustrations
On these pages we illustrate key aspects of DataModeler and SymbolicRegression. All illustrations are available in the tutorial DataModeler FAQ and Examples. If you have Mathematica installed you can download and view the tutorial while enjoying interactive graphics from here. Links below will direct you to html documents, which can be viewed in any browser but do not have metadata in the graphics.

Illustration 1: Creative Hypothesis generation — Conventional methods impose artificial constraints on the models, e.g., polynomials, despite the fact that these constraints do not have a physical basis and are only imposed to make the mathematics tractable for that method or due to a lack of imagination. SymbolicRegression, in contrast, lets the data define the model form free of artificial limits. Part of this is the ability to hypothesize and explore diverse potential model structures.

Illustration 2: Rapid Variable Selection and Modeling — SymbolicRegression can identify driving variables and return quality models — in just a few seconds in some cases.

Illustration 3: Variable Selection and Modeling in Underdetermined (fat) Data Arrays — Due to its ability to identify and focus on driving variables, SymbolicRegression can build models data sets that have more variables than records.

Illustration 4: Redundant Information in the Data — Realworld data sets often contain redundant data records, i.e., the nearly the same information repeated many times. For any data modeling technique, redundant data slows the model development and may also degrade model quality. DataModeler offers tools for ranking data records based upon their incremental information content. This insight may be used to accelerate the model development as well as produce higherquality models.

Illustration 5: Working with Big Data Sets — DataModeler's SymbolicRegression algorithms are stateoftheart and remarkably efficient. However, for really large data sets, there are strategies which can be useful beyond simply allocating more CPU effort. The key is to recognize that for large data sets, generally, not all the data has the same information content and to exploit that fact.

Illustration 6: Handling Correlated Variables — Although common, correlated variables are a major problem for most modeling techniques. In contrast, SymbolicRegression can identify the best of correlated inputs and may synthesize metavariables to produce insightful, robust and quality models.

Illustration 7: Dealing with Noisy Data — Modeling noisy data is difficult since we want to model the fundamental behavior and not noiseinduced perturbations. DataModeler and SymbolicRegression lets us easily identify the models providing the best tradeoff between model complexity, accuracy and constituent variables.

Illustration 8: Trustable Regression Models — Empirical models only know the information they have been provided during their development. As a result, using them is a bit like driving a car only using the rearview mirror. Ensembles of diverse but accurate models let us take some of the trepidation out of model use by providing a warning that either the system dynamics have changed or the model is being asked to operate in uncharted territory. Trustable models is a unique and valuable benefit of SymbolicRegression which is possible because we can develop diverse model structures which are comparable in both accuracy and complexity.

Illustration 9: Modelbased Outlier Detection — Outlier detection for nonlinear systems with lots of input variables is very hard to achieve using conventional methods. However, an outlier is either the most important nugget in the data set or something which should be removed from the modeling process to avoid distorting the results. Deciding which requires human insight. DataModeler provides tools for outlier detection both before and after the model development. In this illustration we look at modelbased outlier detection.

Illustration 10: Active Design of Experiments — The conventional approach of experimental design is to: (1) make a list of simplifying assumptions, (2) assume a model form based upon ease of data analysis, (3) run a batch of experiments, (4) check whether the data confirmed the a priori assumptions and, if not, start over. The ability of Symbolic Regression to synthesize, assess and refine models and, furthermore, build a trust metric on where those models are valid means, that we can integrate modeling and data collection and shift from a passive collectdatathenanalyse mode into an active learning framework. The benefits are huge. At the end of the data collection we have both an awareness of significant variables and a quality response model. Furthermore, at each step of the process, we have chosen the next data point to maximize the anticipated information content — thereby achieving a better result with few experiments than if a conventional approach were adopted. For some systems, an active Design of Experiments (DoE) strategy could require ordersofmagnitude fewer experiments be conducted. The implications of timetomarket, product quality and customer satisfaction should be obvious.