Release news and events

DataModeler Release 8.16 (1 March 2013)

Friday, March 1, 2013

We have a new (207 MB) DataModeler release available for your retrieval.

The main focus is two new functions, closing an evolutionary loophole and addressing a potential Mathematica pathology. The new functions are DataCompletenessMap and DataCompletenessPlot which let us easily look at the prevalence and distribution of nonnumerics in data which we want to model. Of course, we layer the information presented with intelligent and easily specified tooltips and make it easy to generate quality and insightful graphics.

The evolutionary loophole which has been closed is in the handling of nonnumerics. Previously, we had a NumericColumnThreshold which would restrict model development to those inputs which contained at least a specified fraction of numeric entries (and only the numeric predictions would be considered in evaluating model quality). However, if we allowed more than the default fraction of missing elements, then the model search algorithm could combine multiple inputs and the net result would be that models would be evaluated considering an even smaller fraction of the data records. To address this, we introduced a NumericPredictionRequirement which required models to evaluate to a numeric for a minimum fraction of the data records — otherwise, the model would be rejected even though the inputs were individually satisfying the completeness requirement.

The pathology issue is that Mathematica can consume all available memory/virtual memory/disk space when doing very difficult model searches via very long modeling runs with transcendental functions. Although this behavior is stochastic and is an issue for rare data sets and modeling problems, this is effectively a memory leak which we now monitor and truncate if a MemoryLimit is reached and return/archive the results at that point as we would if a TimeConstraint were encountered.

The complete release notes are below:

The highlights of this release are the new DataCompletenessMap and DataCompletenessPlot. These are useful to get the zen of data sets which feature incomplete data. Related to this, we have introduced a new SymbolicRegression option, NumericPredictionRequirement, since the default of evaluating model quality only on complete records in the data meant that models could, in extreme circumstances, be assessed on far fewer data records than expected even though each of the constituent variables passed the NumericColumnThreshold.

Mathematica also appears to have some memory leak issues which can come into play for very long modeling runs. To address this, we implemented a MemoryLimit on SymbolicRegression to avoid runaway consumption of RAM and disk space.

  • Introduced two new functions related to data completeness, DataCompletenessMap and DataCompletenessPlot, which provide a visual assessment of the presence of non-numeric elements in a data set.
  • Addressed a problem wherein models could be assessed on a smaller than expected fraction of the data set even though the individual variables satisfied the completeness threshold specified by the NumericColumnThreshold. Hence, we introduced a new option NumericPredictionRequirement (default associated with SymbolicRegression) which sets the minimum fraction of data records which must be numerically evaluatable by a model for the non- numerics to be automatically excluded from the assessment of ModelQuality. Along with this change, the NumericColumnThreshold default has been set to Automatic which will use the NumericPredictionRequirement as its threshold.
  • Modified SmallPlot so that it stays unevaluated if the input format is not recognized. This allows it to be used as a pure function for options such as ToolTipFunction.
  • Implemented a default Mesh -> Automatic setting for CorrelationMatrixPlot which will suppress the mesh if a large number of data columns are being plotted. Also trapped the situation where nominally numeric columns didn't have any numeric overlap.
  • Modified SymbolicRegression so that StoreModelSet can be used within the GenerationMonitor, CascadeMonitor, RunMonitor or EvolutionMonitor pure functions.
  • Implemented a MemoryLimit option for SymbolicRegression to guard against the Mathematica kernel memory leaks during the model search. This places an upper bound on the incremental memory required during each of the IndependentEvolutions. Additionally, a MemoryMonitor option was created which can return the profile of memory consumption over the course of each kernel's model search.

DataModeler Release 8.13 (3 Dec 2012)

Monday, December 3, 2012

We are glad to announce that this release should be compatible with both Mathematica 8 and 9!

If you do encounter any issues, please send them our way. Thanks!

The official Release Notes for 8.13

  • The help has been rebuilt using Mathematica 9 so it is searchable using both versions.
  • Wolfram Research changed the ChartElementFunction setting names for BoxWhiskerChart with version 9 and about half of these are not functional yet. Hence, the default for VariablePresenceDistributionChart was changed to "BoxWhisker" which is a setting for both version 8 and 9 and works in both.
  • ParetoFrontPlot was misbehaving in Mma 9 (due to a change in how ListPlot handled the DataRange option) but is now displaying properly.
  • The PlotLegends package has been deprecated. Since CorrelationMatrixPlot was the only function which overtly used PlotLegends and Mathematica 9 has a nice implementation of PlotLegend as an option throughout most Mathematica graphics, we deleted the dependency upon the PlotLegends package.
  • DataOutliers for EnsembleResidualPlot were not being handled properly in some rare cases. This has been fixed.

DataModeler Release 8.12 (27 Nov 2012)

Tuesday, November 27, 2012

We are happy to release DataModeler 8.12. The key highlights are:

  • Archived models are now automatically compressed to reduce file sizes (factors of 25 are a good thing).
  • Implemented support for VariablesToPlot option in a variety of functions — this makes data and performance exploration much cleaner for high-dimensional data sets as we gain insight into the key inputs.
  • Implemented support for display of DataOutliers in a variety of functions. Associated with this support is changing some plotting defaults since the outliers will by default be denoted in red.
  • Greatly improved the behavior of BivariatePlot so that large multi-dimensional data sets can be safely handled without blowing out the memory footprint of the notebook.
  • Implemented a new function ModelPredictionComparisonPlot which is useful to look at prediction performance trajectories relative to the observed behavior. The use of the SortBy, DataVariableReference and DataOutliers options make this a pretty powerful function. Under the hood, it uses SmallPlot so it can efficiently handle large data sets.
  • Modified the MultiCore behavior of SymbolicRegression to allow finer-grain control of the number of cores operating in parallel. (The upcoming Mathematica 9 appears to offer more subkernel licenses so we can peg crunching capabilities of our machines.)
  • Note for Mathematica 9 testers: Mathematica 9, stomps on a couple of DataModeler function (KernelID and AbsoluteCorrelation) so we have renamed those functions in this release (they are still supported in Mma8). Mma9 also changes the documentation system so the current help is not discoverable; however, model development and existing notebooks should work.

Since it is pretty spiffy, let's quickly look at the ModelPredictionComparisonPlot. One basic use is to look at model performance against time series data. Here we look at data from a distillation column with DataOutliers highlighted.

ModelPredictionComparisonPlot

We can also use SortBy and DataVariableReference to reorder the data or define the x-axis in the plot. Note that in this case the frame labels are automatically adjusted to provide the audit trail information.

ModelPredictionComparisonPlot with DataVariableReference

The official release notes and changes for 8.12:

  • Implemented support in ResponsePlotExplorer for DataVariableLabels. These will now be used for the variable sliders. The default behavior will be to use ColorizeList to color code the ModelVariables used for the slider labels so that they match those used in the graphic labels.
  • Added a Compress option to StoreModelSet which determines whether the archived models should be processed using Compress to reduce file sizes. The default is to compress the files. The complementary RetrieveModelSet function will recognize the archival choice and Uncompress the file, if needed.
  • Fixed a bug in SymbolicRegression wherein MetaVariables were not being supported subsequent to the first of the IndependentEvolutions of each kernel or subkernel. This would manifest itself as a pathology if only a single variable was supplied to the modeling.
  • Modified the InversePatternMapping rules associated with SymbolicRegression. Although functionally similar to the previous performance, orders-of-magnitude speed gains were realized relative to the previous approach when ActiveGenomeSimplification was enabled with a SimplificationFunction setting of Expand or ExpandAll.
  • Modified OptimizeModel and OptimizeModelExpression to accept options for SelectModels if a list of models are supplied.
  • Modified GridTable to support options appropriate for Framed. Now the bounding box can be suppressed by setting FrameStyle to None and the appearance tweaked via other options such as Background, RoundingRadius, FrameMargins, etc.
  • Modified UnivariatePlot, BivariatePlot, DataDistributionPlot, CorrelationChart and CorrelationMatrixChart to support a VariablesToPlot option. This makes data exploration easier as the data modeling progresses and high-priority inputs are identified.
  • Implemented support in a variety of functions for the display and annotation of DataOutliers. These include UnivariatePlot, BivariatePlot, ModelPredictionPlot, EnsemblePredictionPlot, ModelResidualPlot and EnsembleResidualPlot. As part of this change, the plot style for many functions has been changed so that, by default, red is reserved for denoting outliers.
  • Extensive modifications to improve the scaling and functionality of BivariatePlot. Provided support for data subsampling within BivariatePlot so that the n^2 expansion in the graphics does not produce an inordinately large memory footprint if large data sets are supplied. Support for displaying DataOutliers and controlling the VariablesToPlot was also incorporated as well as allowing finer control of setting the various graphic styles.
  • Modified UnivariatePlots to support a DataVariableReference option which allows the x-axis for the plots to be specified rather than just looking at the data trajectory. This option is useful if the data records are not uniformly sampled.
  • Modified RangeLength to support a specified start value. Thus RangeLength[ list, 0 ] will produce a zero- relative indexing rather than the default 1-relative behavior.
  • Modified SymbolicRegression to support TargetColumn option settings of one of the DataVariables or DataVariableLabels. Previously, this had to be specified as an index into one of the columns of the supplied data matrix. Last or First are now also valid settings with Last (i.e., the final data column) continuing to be the default.
  • Changed the default EnsembleDivergenceFunction to (3*StandardDeviation[#]&) from the previous settings of the model extremals. Since we target diverse models in assembling a ModelEnsemble and include “sloppy but good”models as a means to detect extrapolation and changes in the fundamentals of the targeted system, we want to flag the divergence of the models. Given the stochastic nature of the model selection, the envelope of predictions is implies too much confidence in the extremal models. Conversely, we want to incorporate them into the assessment so we do not want to use a robust statistic such as MedianDeviation as a foundation. The 95% confidence limit chosen based upon the (nonrobust) StandardDeviation seems like a reasonable compromise given the operational purpose of the EnsembleDivergenceFunction.
  • Modified the default SignificanceLevel for MetaVariableDistributionChart and MetaVariableDistributionTable to be { 10, 0.4 }. This form requires that a MetaVariable be present in at least 10 models of at least one of the IndependentEvolutions and it be in at least 40% of the models of one of the IndependentEvolutions (not necessarily the same one). Setting the minimum threshold for model count avoids trivial results when only one model from an independent evolution might have passed the selection (e.g., QualityBox) critieria.
  • Implemented a new function, ModelPredictionComparisonPlot, which show the prediction overlaid on the observed behavior. The SortBy option can be used to sequence the data records of the supplied data sets and DataVariableReference may be used to specify the x-axis.
  • Modified ConfidenceEllipsoid, ConfidenceEllipsoidSelection and ConfidenceEllipsoidSelectionIndices to allow duplicate and constant data columns to be supplied. The supplied data still has to be strictly numeric.
  • Modified the MultiCore option for SymbolicRegression. Mathematica 9 will introduce support for more subkernels so we will be able to tap into the multiple physical and virtual cores (available via hyperthreading). MultiCore may now be specified as None, Automatic, All or an integer ranging up to the $ProcessorCount for the machine. Each additional subkernel will reduce the CPU effort allocated to the individual IndependentEvolutions; however, the search diversity is generally a benefit. Testing indicates that the All setting will approximately halve the number of modeling generations for a given selected TimeConstraint relative to running a single kernel and by 25% relative to using half of the available kernels (i.e., the Automatic option setting) — hence, it may be desirable to lengthen the TimeConstraint. We also implemented some recovery support when subkernels spontaneously disconnect and get lost — but we still cannot recover the licenses associated with the lost kernels until Mathematica is restarted. The default setting is Automatic to allow for use of other applications; however, for serious (e.g., overnight) model search, a setting of All would probably be appropriate.
  • The soon-to-be-released Mathematica 9 introduces two new functions which stomp on DataModeler functions. AbsoluteCorrelation is about half the speed of the DataModeler implementation so we have renamed the current version to AbsCorrelation. Similarly, KernelID is an undocumented developer function so we have renamed the DataModeler implementation to KernelNumber. AbsoluteCorrelation and KernelID will continue to work underneath Mathematica 8 (and, possibly, under Mathematica 9).
  • Modified SubSample and SmallPlot to handled supplied lists of Tooltips. If the dataset size exceeds the DataSegments limit, the tooltips will be stripped. Otherwise, the tooltips will be restored after processing. Included in this is working around a bug in ListPlot wherein it does not handle the display of doublets (i.e., two DataOutliers in ModelPredictionComparisonPlot).

DataModeler Release 8.08 (16 May)

Wednesday, May 16, 2012

We are proud to release DataModeler 8.08 (16 May 2012)! Other than working around a "designed-as bug/feature" in Compile, the theme of this release is a major new capabilities for metavariable identification and exploitation. A metavariable is simply a combination of variables or a transform of a variable which is useful in the developed models. There are eight new functions supporting this capability:

New Functions

We can exploit the diversity of model forms developed during SymbolicRegression by running many IndependentEvolutions and looking for those MetaVariables which are prevalent in quality models. This can provide insight into underlying mechanisms and alternative paths to quality models — which is especially useful when we have highly correlated/coupled inputs and multiple paths to producing models of comparable accuracy and conciseness.

MetaVariableDistributionChart

Of course, as illustrated below, we can also specify MetaVariables to be exploited and explored during SymbolicRegression. In this fashion, we can explore the potential of these metavariables as well as bias the model search towards their exploitation (since the supplied variable combinations or transforms do not have to be rediscovered). If we were so inclined, we could even exclude the direct use of any of the DataVariables and use only the MetaVariables in the model search.

Symbolic Regression Example

In summary, the support for the discovery and exploitation of MetaVariables is a major enhancement in DataModeler. In addition to the documentation and help examples, you might also like to check out the new case study, Symbolic Regression is Not Enough, which looks at these new capabilities within the context of a modeling workflow and also highlights some of the recently-added capabilities around variable combination analysis. (To get to the case studies, open up the DataModeler guide page in Mathematica's help and click on the tutorials link in the introductory paragraph.)

The official release notes for 8.08:

Support for the identification and exploitation of MetaVariables was the main theme of this release. However, we also discovered a "designed as" bug in the Mathematica Compile behavior that warrants a workaround.

  • The default behavior of Compile is to value speed rather than quality. Hence, for example, evaluating Compile[{x},UnitStep[1/x]][0] will return a value of 1 rather than detecting the pathology. When reported to WRI, the official response was that this was proper behavior. Since this is a dangerous behavior given the disparate model forms synthesized during SymbolicRegression, NumericCompile has been modified to use RuntimeOptions -> "Quality". More discussion is in the NumericCompile help.
  • Implemented support for MetaVariables. Now users can specify MetaVariables to SymbolicRegression and those will be used in the model development (the returned models, however, will be expressed in terms of the native DataVariables). Specifying these variable transforms and combinations can accelerate the model discovery as well as guide the structure of the models returned.
  • Added a new case study, Symbolic Regression is Not Enough, based upon our chapter for the 2012 Genetic Programming Theory & Practice Workshop. This paper looks at the issues around the modeling process and highlights the need for context and tools to identify and select key variables and metavariables in the pursuit of deployable models.
  • Implemented a suite of functions to identify, prioritize and extract MetaVariables from developed models. MetaVariables, MetaVariablePresence and MetaVariableTable look at the aggregated model set for metavariables.
  • Also implemented were functions looking at the variability of metavariable discovery. These functions, MetaVariableDistribution, MetaVariableDistributionTable and MetaVariableDistributionChart partition the supplied models into their IndependentEvolutions and can give insight into key transforms if there are many possible variable combinations which lead to quality models.
  • Implemented a MetaVariableModels function which will synthesize GPModels in terms of the DataVariables.
  • Implemented an AugmentData function which will append colums to the supplied dataMatrix based upon the MetaVariables option setting.
  • Implemented support for a SortBy option in UnivariatePlot. This can be either an index into the columns of the data matrix or one or more of the components of the supplied DataVariableLabels. This looks like it will be a very insightful augmentation for some data sets.
  • Implemented a new function, RangeLength, which returns Range[Length[x]]. Although simple, this utility function was requested by users due to the frequency of needing this behavior.
  • Implemented a Tooltip option for ResponsePlot, ResponseSurfacePlot and ResponsePlotExplorer to suppress the display of the reference values on the response curves. Unfortunately, although they are very useful, Mathematica's implementation of tooltips is very fragile and can cause the continuous reformatting of notebooks.
  • Fixed a bug in LabelString (and, by extension, LabelForm) where the NumberFormatting option was not being applied to real values within expressions.
  • Generalized CreateDataVariableNames to handle formatted inputs. Previously, it could also handle lists of formatted inputs so the documentation was also updated to reflect that capability.
  • Extended SymbolicRegression and SelectModels to handle combinations of the output of DriverVariables and DriverVariableCombinations as inputs to the ModelingVariables, AllowedVariables and RequiredVariables options. This form will also be valid for any of the many functions that implicitly use SelectModels.
  • Fixed a bug in SymbolicRegression wherein modeling would fail if non-numeric data was supplied and Rescale was enabled.

Enjoy.

DataModeler Release 8.06 (26 January)

Thursday, January 26, 2012

The theme of this release is a significantly enhanced DataOutlierTable. DataOutlierTable function now allows DataRecordLabels to be displayed as well as other changes to improve the information display. Also changed the Input option to display the input variables to VariablesToPlot and added some flexibility and clarity to the input data display.

Besides we attached a Tooltip to the titles of the VariableCombinationMap, VariableCombinationChart, and VariableCombinationTable showing the cumulative percentage of the total number of distinct combinations in the model set. Since there is a combinatorial explosion of possibilities when many input variables are being considered, this provides some context given that many variable combinations may not satisfy the SignificanceLevel threshold for display in the graphic.

Several new bugs were fixed:

  • Fixed a bug in BivariatePlot wherein the warning messages if non-numeric data were supplied were generating incorrect numbers of columns affected.
  • Fixed a bug in UnivariatePlot wherein if (the default) GraphicsArrayColumns -> Automatic was being used, the number of data columns supplied rather than the number of numeric data columns would be used in calculating the layout.
  • Fixed a bug in ModelSelectionReport and ModelSelectionTable wherein supplying certain colors as the ColorFunction would cause the details of the Style construct to be displayed.
  • Fixed a bug in LabelString (and, by extension, LabelForm) wherein symbols and tooltips were not being handled properly if a list was supplied and the Joined -> True option was enabled. The new behavior is for the tooltip content to be stripped since a Tooltip cannot be converted into a form acceptable to StringJoin.
  • Increased the default NumericColumnThreshold options setting for SymbolicRegression to 0.75 (from 0.7). This means that any supplied data column must be at least 75% numeric to be included in the model development.
  • Extended VariablePresenceChart and VariablePresenceDistributionChart to accomodate the output of DriverVariableCombinations being supplied as the input to the VariablesToPlot option.
  • Modified MakeDataNumeric to allow a ReplacementFunction -> None setting. This just returns the originally supplied data structure. Additionally, we can specify None as part of a list; in this case the corresponding column would be returned unmodified.
  • Fixed a parsing bug in MergeInputResponseData wherein elements (non-lists) which were constructs (e.g., Π ) were not being recognized as being “atomic”.

Enjoy!